Web Scraping by End Users

dc.contributor.authorAlex Tacuri
dc.contributor.authorSérgio Firmenich
dc.contributor.authorAlejandro Fernández
dc.contributor.authorMaría Florencia Riva
dc.contributor.authorMatías Urbieta
dc.contributor.authorGustavo Rossi
dc.coverage.spatialBolivia
dc.date.accessioned2026-03-22T19:50:57Z
dc.date.available2026-03-22T19:50:57Z
dc.date.issued2025
dc.description.abstractScraping is a topic studied from various perspectives, encompassing automatic and AI-based approaches, and a wide range of programming libraries that expedite development. As the volume of available web content increases, it becomes increasingly challenging to anticipate end-user requirements regarding what, how, and when to extract data from the web. This challenge is compounded when integrating data from multiple websites, particularly when websites’ search engines dynamically retrieve unavailable data via permanent links. Complex scraping processes, such as these are difficult to develop using general-purpose programming languages and are challenging to automate with AI-based approaches. Controllability is a crucial aspect of scraping, that is, how end users can make decisions during the scraper specification process, understand information sources, and how the data are ultimately extracted, compiled, and formatted for output. In response, our study presents an innovative end-user approach for specifying scrapers that focuses on seamlessly integrating data from multiple sources. Through this approach and its supporting toolset, we aim to provide users with greater control and transparency over the extraction, integration, and formatting of data, thereby addressing the key concerns in web scraping. The approach and toolset were evaluated and they yielded promising results.
dc.identifier.doi10.1109/access.2025.3636662
dc.identifier.urihttps://doi.org/10.1109/access.2025.3636662
dc.identifier.urihttps://andeanlibrary.org/handle/123456789/78484
dc.language.isoen
dc.publisherInstitute of Electrical and Electronics Engineers
dc.relation.ispartofIEEE Access
dc.sourceUniversidad Nacional de La Plata
dc.subjectComputer science
dc.subjectDisk formatting
dc.subjectScraper site
dc.subjectKey (lock)
dc.subjectWorld Wide Web
dc.subjectEnd user
dc.subjectTransparency (behavior)
dc.subjectWeb page
dc.subjectMashup
dc.subjectWeb application
dc.titleWeb Scraping by End Users
dc.typearticle

Files