ieee research paper on web crawler

A review on techniques for optimizing web crawler results

Ieee account.

Change Username/Password
Update Address

Purchase Details

Payment Options
Order History
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical Interests
US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support
About IEEE Xplore
Accessibility
Terms of Use
Nondiscrimination Policy
Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Reinforcement Learning in Deep Web Crawling: Survey

Conference paper
First Online: 20 September 2021
Cite this conference paper

Kapil Madan 19 &
Rajesh Bhatia 19

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1374))

1209 Accesses

Context: Reinforcement learning (RL) can help in solving various challenges of deep web crawling. Deep web content can be accessed by filling the search forms rather than hyperlinks. Understanding the search form and proper selection of queries are necessary steps to retrieve the deep web content successfully. Thus, crawling the deep web is a very challenging task. The reinforcement learning-based technique helps in filling the search form and retrieving the deep web content successfully. RL selects the action based on the given state, and the environment assigns reward/penalty to the selected action. Objective: This study reports a survey of RL-based techniques applied in the domain of deep web crawling. Method: Existing literature survey is based on 31 articles from 77 articles published in various reputed journals, conferences, and workshops. Results: Challenges related to various crawling steps of deep web crawling are presented. RL-based techniques are being used in multiple research papers, which solves deep web crawling challenges. Comparative analysis of RL techniques used in deep web crawling is done based on the strength, metrics, dataset, and research gaps. Conclusion: Various RL-based techniques can be applied to deep web crawling, which has not been explored yet. Open challenges and research directions are also recommended.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Available as PDF
Read on any device
Instant download
Own it forever
Available as EPUB and PDF
Compact, lightweight edition
Dispatched in 3 to 5 business days
Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Bergman, M. K. (2001). White paper: The deep web: Surfacing hidden value. Journal of Electronic Publishing , 7 (1).

Google Scholar

Hernández, I., Rivero, C. R., & Ruiz, D. (2019). Deep web crawling: A survey. World Wide Web, 22 (4), 1577–1610.

Article Google Scholar

Zheng, Q., Wu, Z., Cheng, X., Jiang, L., & Liu, J. (2013). Learning to crawl deep web. Information Systems, 38 (6), 801–819.

Mishra, A., Mattmann, C. A., Ramirez, P. M., & Burke, W. M. (2018). ROACH : Online apprentice critic focused crawling via CSS cues and reinforcement. In Proceedings of ACM Conference on Knowledge Discovery and Data Mining (KDDD 2018) , August (pp. 1–9).

Leslie Pack Kaelbling, A. W. M., & Littman, M. L. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research , 708–713.

Moraes, M. C., Heuser, C. A., Moreira, V. P., & Barbosa, D. (2013). Prequery discovery of domain-specific query forms: A survey. IEEE Transactions on Knowledge and Data Engineering, 25 (8), 1830–1848.

Kantorski, G. Z., Moreira, V. P., & Heuser, C. A. (2015). Automatic filling of hidden web forms. ACM SIGMOD Record, 44 (1), 24–35.

Saini, C., & Arora, V. (2016). Information retrieval in web crawling: A survey. In 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 2635–2643).

Kumar, M., Bhatia, R., & Rattan, D. (2017). A survey of web crawlers for information retrieval. WIREs Data Mining Knowledge Discovery , 7 (6), e1218.

Li, S., Chen, C., Luo, K., & Song, B. (2019). Review of deep web data extraction. In 2019 IEEE Symposium Series on Computational Intelligence (SSCI) (pp. 1068–1070).

Google Scholar. 2020. [Online]. Available http://scholar.google.com/ . Accessed 30 December 2020.

Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. In 27th VLDB Conference—Roma, Italy (pp. 1–10).

Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction .

Kumar, M., & Bhatia, R. (2018). Hidden webpages detection using distributed learning automata. Journal of Web Engineering, 17 (3–4), 270–283.

Wirth, C., Akrour, R., Neumann, G., & Fürnkranz, J. (2017). A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 2 , 30–34.

MATH Google Scholar

Shah, S., Patel, S., & Nair, P. S. (2014). Focused and deep web crawling—A review. International Journal of Computer Science and Information Technologies, 5 (6), 7488–7492.

Akilandeswari, J., & Gopalan, N. P. (2007). A novel design of hidden web crawler using reinforcement learning based agents. In Advanced parallel processing technologies (Vol. 4847, pp. 433–440). Springer.

Marin-Castro, H. M., Sosa-Sosa, V. J., Martinez-Trinidad, J. F., & Lopez-Arevalo, I. (2013). Automatic discovery of web query Interfaces using machine learning techniques. Journal of Intelligent Information System, 40 (1), 85–108.

Chakrabarti, S., Punera, K., & Subramanyam, M. (2002). Accelerated focused crawling through online relevance feedback. In Proceedings of 11th International Conference on World Wide Web, WWW’02 (pp. 148–159).

Sharma, D. K., & Sharma, A. K. (2011). A QIIIEP based domain specific hidden web crawler. In Proceedings of the International Conference & Workshop on Emerging Trends in Technology—ICWET ’11 (pp. 224–227).

Singh, L., & Sharma, D. K. (2013). An approach for accessing data from hidden web using intelligent agent technology. In 2013 3rd IEEE International Advance Computing Conference (IACC) (pp. 800–805).

Alzubi, O. A., Alzubi, J. A., Ramachandran, M., & Al-shami, S. (2020). An optimal pruning algorithm of classifier ensembles: Dynamic programming approach. Neural Computer Applications , 6 .

Zhang, Z., Du, J., & Wang, L. (2013). Formal concept analysis approach for data extraction from a limited deep web database. Journal of Intelligent Information System, 41 (2), 211–234.

Pavai, G., & Geetha, T. V. (2017). Improving the freshness of the search engines by a probabilistic approach based incremental crawler. Information Systems Frontiers, 19 (5), 1013–1028.

Pratiba, D., Shobha, G., Lalithkumar, H., & Samrudh, J. (2017). Distributed web crawlers using hadoop. International Journal of Applied Engineering Research, 12 (24), 15187–15195.

Ahmed Md. Tanvir, M. C. (2019). Design and implementation of web crawler utilizing unstructured data. Journal of Korea Multimedia Society , 22 (3), 374–385.

Gupta, D., Rodrigues, J. J. P. C., Sundaram, S., Khanna, A., Korotaev, V., & De Albuquerque, V. H. C. (2018). Usability feature extraction using modified crow search algorithm: A novel approach. Neural Computer Application , 6 .

Murali, R. (2018). An intelligent web spider for online e-commerce data extraction. In 2018 Second International Conference on Green Computing and Internet of Things (ICGCIoT) (pp. 332–339).

Tahseen, I., & Salim, D. (2018). A proposal of deep web crawling system by using breath-first approach. Iraqi Journal of Information and Communications Technology , 48–61.

Tanvir, A. M., Kim, Y., & Chung, M. (2019). Design and implementation of an efficient web crawling using neural network. In Advances in computer science and ubiquitous computing (pp. 116–122). Springer.

Patil, Y., & Patil, S. (2016). Implementation of enhanced web crawler for deep-web interfaces. International Research Journal of Engineering and Technology , 2088–2092.

Download references

Author information

Authors and affiliations.

Punjab Engineering College (Deemed to be University), Chandigarh, India

Kapil Madan & Rajesh Bhatia

You can also search for this author in PubMed Google Scholar

Editor information

Editors and affiliations.

Department of Computer Science Engineering, Maharaja Agrasen Institute of Technology, Rohini, Delhi, India

Deepak Gupta

Maharaja Agrasen Institute of Technology, Rohini, Delhi, India

Ashish Khanna

Institute of Engineering and Technology, Lucknow, Uttar Pradesh, India

Vineet Kansal

University of Calabria, Rende, Cosenza, Italy

Giancarlo Fortino

Department of Information Technology, Cairo University, Giza, Egypt

Aboul Ella Hassanien

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper.

Madan, K., Bhatia, R. (2022). Reinforcement Learning in Deep Web Crawling: Survey. In: Gupta, D., Khanna, A., Kansal, V., Fortino, G., Hassanien, A.E. (eds) Proceedings of Second Doctoral Symposium on Computational Intelligence . Advances in Intelligent Systems and Computing, vol 1374. Springer, Singapore. https://doi.org/10.1007/978-981-16-3346-1_24

Download citation

DOI : https://doi.org/10.1007/978-981-16-3346-1_24

Published : 20 September 2021

Publisher Name : Springer, Singapore

Print ISBN : 978-981-16-3345-4

Online ISBN : 978-981-16-3346-1

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Publish with us

Policies and ethics

Find a journal
Track your research

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser .

Enter the email address you signed up with and we'll email you a reset link.

We're Hiring!
Help Center

Crawling the Dark Web: A Conceptual Perspective, Challenges and Implementation

2019, Journal of Digital Information Management

Internet and network technologies have evolved dramatically in the last two decades, with rising users' demands to preserve their identities and privacy. Researchers have developed approaches to achieve users' demands, where the biggest part of the internet has formed, the Deep Web. However, as the Deep Web provides the resort for many benign users who desire to preserve their privacy, it also became the perfect floor for hosting illicit activities, which generated the Dark Web. This leads to the necessity of finding automated solutions to support law and security agencies in collecting information from the Dark Web to disclose such activities. In this paper, we illustrate the concepts needed for the development of a crawler that collects information from a dark website. We start from discussing the three layers of the Internet, the characteristics of the hidden and private networks, and the technical features of Tor network. We also addressed the challenges facing the dark web crawler. Finally, we presented our experimental system that fetches data from a dark market. This approach helps in putting a single dark website under investigation, and can be a seed for future research and development.

Related Papers

European Review of Organised Crime (EROC)

Criminologists have traditionally used official records, interviews, surveys, and observation to gather data on offenders. Over the past two decades, more and more illegal activities have been conducted on or facilitated by the Internet. This shift towards the virtual is important for criminologists as traces of offenders’ activities can be accessed and monitored, given the right tools and techniques. This paper will discuss three techniques that can be used by criminologists looking to gather data on offenders who operate online: 1) mirroring, which takes a static image of an online resource like websites or forums; 2) monitoring, which involves an on-going observation of static and dynamic resources like websites and forums but also online marketplaces and chat rooms and; 3) leaks, which involve downloading of data placed online by offenders or left by them unwittingly. This paper will focus on how these tools can be developed by social scientists, drawing in part on our experience developing a tool to monitor online drug “cryptomarkets” like Silk Road and its successors. Special attention will be given to the challenges that researchers may face when developing their own custom tool, as well as the ethical considerations that arise from the automatic collection of data online.

Pastrana, S., Thomas, D. R., Hutchings, A., & Clayton, R. (2018). CrimeBB: Enabling cybercrime research on underground forums at scale. Lyon: ACM International World Wide Web (WWW) Conference.

Alice Hutchings

Underground forums allow criminals to interact, exchange knowledge, and trade in products and services. They also provide a pathway into cybercrime, tempting the curious to join those already motivated to obtain easy money. Analysing these forums enables us to better understand the behaviours of offenders and pathways into crime. Prior research has been valuable, but limited by a reliance on datasets that are incomplete or outdated. More complete data, going back many years, allows for comprehensive research into the evolution of forums and their users. We describe CrimeBot, a crawler designed around the particular challenges of capturing data from underground forums. CrimeBot is used to update and maintain CrimeBB, a dataset of more than 48m posts made from 1m accounts in 4 different operational forums over a decade. This dataset presents a new opportunity for large-scale and longitudinal analysis using up-to-date information. We illustrate the potential by presenting a case study using CrimeBB, which analyses which activities lead new actors into engagement with cybercrime. CrimeBB is available to other academic researchers under a legal agreement, designed to prevent misuse and provide safeguards for ethical research.

Alf Beauman (PI) 🇪🇺

Javad Hosseinkhani

IJCSMC Journal

QUEST JOURNALS

World Wide Web (or simply web) is a massive, wealthy, preferable, effortlessly available and appropriate source of information and its users are increasing very swiftly now a day. To salvage information from web, search engines are used which access web pages as per the requirement of the users. The size of the web is very wide and contains structured, semi structured and unstructured data. Most of the data present in the web is unmanaged so it is not possible to access the whole web at once in a single attempt, so search engine use web crawler. Web crawler is a vital part of the search engine. It is a program that navigates the web and downloads the references of the web pages. Search engine runs several instances of the crawlers on wide spread servers to get diversified information from them. The web crawler crawls from one page to another in the World Wide Web, fetch the webpage, load the content of the page to search engine's database and index it. Index is a huge database of words and text that occur on different webpage. This paper presents a systematic study of the web crawler. The study of web crawler is very important because properly designed web crawlers always yield well results most of the time.

Javad Hosseinkhani , Hamed Taherdoost

balaji narayanaswami

Web Mining is a natural combination of two active areas of current research, the Data mining and the World Wide Web. It can be classified into three different types i.e. web content mining, web structure mining and web usage mining. In this paper, we focused on some major aspects of web mining like Link Analysis - a data-analysis technique used to evaluate relationships between nodes by evaluating content, Web Crawling - a system that visits Web sites and reads their pages and other information in order to create entries for a search engine index and understand the structure of the web and Recommendation Systems - systems that produce a list of recommendations from web usage patterns. Through this paper, a detailed study of these techniques, algorithms and their future scopes are discussed.

Jakob Demant , gwern branwen

The development of cryptomarkets has gained increasing attention from academics, including growing scientific literature on the distribution of illegal goods using cryptomarkets. Dolliver's 2015 article “Evaluating drug trafficking on the Tor Network: Silk Road 2, the Sequel” addresses this theme by evaluating drug trafficking on one of the most well-known cryptomarkets, Silk Road 2.0. The research on cryptomarkets in general—particularly in Dolliver's article—poses a number of new questions for methodologies. This commentary is structured around a replication of Dolliver's original study. The replication study is not based on Dolliver's original dataset, but on a second dataset collected applying the same methodology. We have found that the results produced by Dolliver differ greatly from our replicated study. While a margin of error is to be expected, the inconsistencies we found are too great to attribute to anything other than methodological issues. The analysis and conclusions drawn from studies using these methods are promising and insightful. However, based on the replication of Dolliver's study, we suggest that researchers using these methodologies consider and that datasets be made available for other researchers, and that methodology and dataset metrics (e.g. number of downloaded pages, error logs) are described thoroughly in the context of web-o-metrics and web crawling.

RELATED PAPERS

International Journal IJRITCC , shubhangi gujar

International Journal of Scientific Research in Science, Engineering and Technology

International Journal of Scientific Research in Science, Engineering and Technology IJSRSET

Melius Weideman

Journal of Web Semantics

Kostyantyn Shchekotykhin

Cybernetics and Physics

Ngo Le Huy Hien

Zubair Rafique

International Journal of Engineering Sciences & Research Technology

Ijesrt Journal

client-honeypot.googlecode.com

Damien Forest

Journal of Universal Computer Science

Marco Calabrese

Dr. Rafiuzzaman Mohammad

IAEME Publication

Andras N A Nemeslaki

International Journal of Computer Network and Information Security

Izzat M Alsmadi

Vinit Gunjan

Proceedings of National Conference on Recent Trends in Parallel Computing (RTPC - 2014)

Mohd Shoaib , Mohammad Khan

International Journal on Digital Libraries

Frank McCown

INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARC

F M Javed Mehedi Shamrat

IOSR Journals

Lecture Notes in Computer Science

Ricardo Baeza-yates

N. Leontiadis , Nicolas Christin

WSEAS Transactions on …

Ashish Sharma

Judith Aldridge , Nathan Ryan , Richard Warnes

Computer Science & Information Technology (CS & IT) Computer Science Conference Proceedings (CSCP)

Kamrul Islam

Huda Rabbani

Knowledge and Information Systems

Proceedings of the 8th annual Conference on WWW Applications. 5-8 September. Bloemfontein, South Africa.

Bryce G Westlake

Hydrobiologia

P.F. Wouters , Isidro Aguillo

Melius Weideman , Zuze, H

Annual Review of Information Science and Technology

Lawrance Lawrance

Mesut Durukal

Judith Aldridge

2013 IEEE CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES

Lohit singh

What is the best custom essay writing service?

In the modern world, there is no problem finding a person who will write an essay for a student tired of studying. But you must understand that individuals do not guarantee you the quality of work and good writing. They can steal your money at any time and disappear from sight.

The best service of professional essay writing companies is that the staff give you guarantees that you will receive the text at the specified time at a reasonable cost. You have the right to make the necessary adjustments and monitor the progress of the task at all levels.

Clients are not forced to pay for work immediately; money is transferred to a bank card only after receiving a document.

The services guarantee the uniqueness of scientific work, because the employees have special education and are well versed in the topics of work. They do not need to turn to third-party sites for help. All files are checked for plagiarism so that your professors cannot make claims. Nobody divulges personal information and cooperation between the customer and the contractor remains secret.

Customer Reviews

IMAGES

Template For Ieee Paper Format In Word
IEEE Paper Format
IEEE research paper downloading process
General architecture of web crawler [1]
Architecture of Web Crawler II. PROPOSED ALGORITHM
Working model of web crawler

VIDEO

O Level Practical Model Paper With Solution
Web Crawling in R (Rcrawler)
Robotic Omni-Finger
paper mache crawler course for axial 1 2/4 deadbolt
How to Download IEEE Research Paper Free By Prof Abhijit Kalbande
Web Crawler

COMMENTS

Analysis of Focused Web Crawlers: A Comparative Study
This research paper presents a comparative study of focused web crawlers, specialized tools designed for targeted information retrieval. By conducting a systematic analysis, the study evaluates the performance and effectiveness of different crawlers. The research methodology involves selecting crawlers based on specific criteria and employing evaluation metrics. Multiple datasets are utilized ...
A review on techniques for optimizing web crawler results
Now a days Internet is widely used by users to satisfy their information needs. In the exponential growth of web, searching for useful information has become more difficult. Web crawler helps to extract the relevant and irrelevant links from the web. To optimizing this irrelevant links various algorithms and technique are used. Discovering information by using web crawler have certain issues ...
PDF A Cloud-based Web Crawler Architecture
A Cloud-based Web Crawler Architecture Mehdi Bahrami1, Mukesh Singhal2 and Zixuan Zhuang3 Cloud Lab University of California, Merced, USA 1IEEE Senior Member, [email protected] 2 IEEE Fellow, [email protected] [email protected] Abstract—Web crawlers work on the behalf of applications or services to find interesting and related information on the web.
(PDF) Exploring Dark Web Crawlers: A Systematic ...
1 Department of Computer and Systems Sciences, (e-mail: [email protected]) 2 Department of Computer and Systems Sciences, (e-mail: [email protected])) Corresponding author: Jesper Bergman (e-mail ...
PDF Design and Implementation of a High-Performance Distributed Web Crawler
[email protected], [email protected] Abstract Broad web search engines as well as many more special-ized search tools rely on web crawlers to acquire large col-lections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, ﬂexibil-
Experimental performance analysis of web crawlers using single and
The ultimate aim of this paper is to present the working of single and multi-threaded web crawling and indexing algorithm using hierarchical clustering. The harvest rate is utilized to measure the harvesting capability of the web crawler. When a web page is crawled, the harvest rate for crawler is computed automatically.
Crawling the Dark Web: A Conceptual Perspective, Challenges and
The results of our hidden web mobile crawler are very promising and approximately 90% of the hidden web pages can be downloaded from a site automatically which is otherwise a very difficult task. View
From Web Scraping to Web Crawling
A past few studies [19,20,21] deal with effective and scalable Web crawlers. The paper starts with the explanation of four general methodologies working behind Web scraping tools and solutions. The later sections dive into a common understanding of Web crawling and implementation of an application-based Web crawler using Scrapy framework.
(PDF) Web Crawling Model and Architecture
Figure 1.8: The main data structures and the operation steps of the crawler: (1) the manager generates a batch of URLs, (2) the harvester. downloads the pages, (3) the gather er parses the pages ...
[PDF] Exploring Dark Web Crawlers: A Systematic Literature Review of
A Tor-based web crawling model was developed into an already existing software toolset customised for ACN-based investigations that was successful in scraping web content from both clear and dark web pages, and scraping dark marketplaces on the Tor network. Strong encryption algorithms and reliable anonymity routing have made cybercrime investigation more challenging. Hence, one option for law ...
PDF Exploring Dark Web Crawlers: A systematic literature review of dark web
The scientific contribution of this paper entails novel knowledge concerning ACN-based web crawlers. Furthermore, it presents a model for crawling and scraping clear and dark websites for the purpose
Reinforcement Learning in Deep Web Crawling: Survey
Only one research paper on RL was presented from the focused crawling domain and missed the RL technique implementation in the deep web domain. Kumar et al. presented a systematic literature review of a web crawler comprising of 248 papers published till the year 2014 . It contained only two research papers related to the RL technique.
Research on Web Data Mining Based on Topic Crawler
This paper analyzes the method of Web information data mining based on topic crawler. This paper puts forward the architecture of Web information search and data mining, and introduces the key technology and operation principle of the architecture. After analyzing the functions and shortcomings of ordinary crawler, this paper focuses on the working principle, implementation method and ...
(PDF) Web Crawler: A Review
In this paper, the applicability of Web Crawler in the field of web search and a review on Web Crawler to different problem domains in web search is discussed. Discover the world's research 25 ...
Web scraping of research paper on IEEE Xplore website using
@DeveshS In the case of the 2nd document, the text of the abstract is, for some reason, sprinkled with non-well-formed xml tags, which don't render on the web page but do render on print.It's possible to extract the abstract from there by using a library such as lxml (although that probably should be presented in a separate question, per SO policy).
(PDF) Summary of web crawler technology research
important role in collecting ne twork data. A web c rawler is a computer program that trave rses hyperlinks. and indexes t hem. As the core part of the vertical search engine, how to make crawlers ...
Crawling the Dark Web: A Conceptual Perspective, Challenges and
In this paper, we illustrate the concepts needed for the development of a crawler that collects information from a dark website. We start from discussing the three layers of the Internet, the characteristics of the hidden and private networks, and the technical features of Tor network. We also addressed the challenges facing the dark web crawler.
Ieee Research Paper On Web Crawler
Margurite J. Perez. #13 in Global Rating. Dr.Jeffrey (PhD) #4 in Global Rating. Our team of paper writers consists only of native speakers coming from countries such as the US or Canada. But being proficient in English isn't the only requirement we have for an essay writer.
Ieee Research Paper On Web Crawler
Ieee Research Paper On Web Crawler: User ID: 407841. 407 . Customer Reviews. Plagiarism report. You are free to order a full plagiarism PDF report while placing the order or afterwards by contacting our Customer Support Team. 448 . Customer Reviews. Estelle Gallagher ...