Scraping

What is Scraping?

Scraping is the process of downloading basic information such as director, actors, descriptions, and cover file from relevant websites based on the filename/code/title of a file.

Different file types theoretically require different scraping solutions. Currently, PLM only supports scraping for movies and Japanese AV videos, with limited support for books and music (using Douban).

The principle of scraping is to use the title/code of a file to search specified websites, extract and interpret the resulting page information, and save it in the file record. Because of this, many websites, such as JavDB, limit access to IPs that query too frequently to prevent scraping. While PLM does not access these websites with the intention of scraping, its behavior is similar. Therefore, if users access websites with anti-scraping mechanisms in large quantities within a short period of time using PLM, they may encounter time-limited access restrictions or even be blacklisted. It is strongly recommended that users do not batch scrape too many records per session/day and consider using a VPN for assistance: after completing a batch of scraping, use a VPN to connect to different servers to obtain different IPs before proceeding to the next batch.

Some websites (e.g., JavLibrary and JavBus) require users to manually verify their age or agree to terms during the first scraping session or even for each batch (e.g., JavLibrary). You can click the "Website" button when selecting a scraper to perform the consent action. PLM will also attempt to automatically determine whether manual intervention is required.

Given that many downloaded movie filenames contain various tags, it is recommended to use the AI title acquisition operation to clean these filenames and obtain reasonable movie titles. PLM will prioritize using the title field content for scraping, followed by the filename.

After scraping is complete, if the scraper did not use the appropriate language, it is recommended to edit the file content and use the translate button to translate the descriptions, actors, director, tags, etc.

During scraping, a list of scrapers will first pop up for the user to select. Multiple scrapers can be selected simultaneously (use the Up/Down buttons to adjust the order). If the first scraper fails or does not find the information, PLM will sequentially use the other selected scrapers. If a profile exists, you can select the profile in the lower right corner of the dialog box for quick selection. You can click the "Website" button to visit the website or perform consent actions.

Users can write their own scripts for scraping. Refer to the built-in examples in $InstallationFolder\scraper\javdb.pas, javhub.pas, etc.

Which Scrapers are Supported?

Currently supported scrapers in PLM:

IMDB: A well-known movie information website, which can scrape movies, providing information such as cover art, director, actors, synopsis, ratings, etc. Supports multiple languages.

Douban: Douban is a well-known Chinese website for movie, book, and music information and reviews. It can scrape movies, books, and music, providing information such as cover art, director, actors, synopsis, and ratings.

TheMovieDB: A well-known movie information website, which can scrape movies, providing information such as cover art, director, actors, synopsis, and ratings.

JavDB: A large Japanese AV video information website. Its characteristic is that the video information is relatively comprehensive, but it has anti-scraping mechanisms and image watermarks. The language support is limited to Traditional Chinese and English.

JavLibrary: A large Japanese AV video information website. Its main features are that it supports many languages (e.g., Japanese) and provides comments where users can potentially download the AV video. The downside is that almost every scraping batch requires manual consent.

JavBus: A large Japanese AV video information website. Its main features are that it has a lot of old video information, fast access speed, and also supports Japanese/Korean. The downside is that first-time use (or after clearing the browser cache) requires manual verification, which is a random Chinese driving test question that users need to Google and answer correctly in its entirety.

JavHub: This website mainly supports English/Japanese.

How to Re-Scrape

If a file has already been scraped, PLM will not perform actual scraping operations on it unless "Force Redo" is checked when selecting the "Scrape" action, or the relevant data is cleared by clicking the "Clear Info" button when editing the file.