OK, here is how I work on such projects. First of all:
[Site exploration] -> [Regular site],
[Search for files within] ->
[ Entire site ]. Other exploration types won't work. Since we want only specific pages of the website and the website has way more than we want - we should exclude all pages of the website using this filter in the
[ Excluded URLs ]:
^https?://(www\.)?hancinema\.net
And now we can add filters to the
[ Included URLs ] to allow the program to crawl only those pages that we want. First, we allow the software to crawl the "company pages" with this filter:
\?p=\d+$
Then we want it to go through the picture galleries, so we add this filter:
/korean_drama_[^/\?]+-picture_gallery\.html([^#]+)?$
Finally, we want it to allow the visits to the full-size photo pages to get the actual full-size photos:
-picture_\d+\.html
That's it. These 3 filters cover all the pages that have to be crawled.
And now the hardest part - we need to prevent Extreme Picture Finder from visiting other companies and picture galleries with addresses that match our filters. This is where
[ Excluded Page Parts ] come into play. You need to open the HTML source of every "page type" that will be crawled. We have 3 page types in our project: company pages, picture gallery pages, and full-size photo pages. Every page should be left with the least amount of HTML text necessary to do the job. The rest has to be cut out with the
[ Excluded Page Parts ].
I prefer to start with the pages that are closest to the files that we need to save. In this case - the full-size photo pages. This one is actually easy: the link to the full-size photo is located inside the only
<picture>...</picture> tag, so we cut everything from the beginning of the HTML text to the first occurrence of the
<picture text like this:
From:
empty
To:
<picture
And now we want to remove everything starting with the closing
</picture tag to leave only the text between the
<picture and
</picture:
From:
</picture
To:
empty
Do not type the text
empty. Just leave the field empty.
In the picture gallery pages, we want to leave only the part that has links to the full-size picture pages. It's located between these tags:
<main
and
<div id="resultmessage">
And for the company pages the content that we want lies between
<main
and
<div class="box article_side section_side">
We also don't need the
<img...> tags because they contain pictures that we don't want, so we remove them with this
[ Excluded Page Part ]:
From:
<img
To:
>
Also, we don't want the mid-size photos, so I added this additional filter to the
[ Excluded URLs ]:
/photos/photo\d+
I also added the sub-folder creation. Photos from different picture galleries will be saved into separate sub-folders created with this Regular Expression:
Expression:
/([^/\?]+)-picture_
Result:
[#1]
That's it. That's all the settings. Here is a project file with all of the above settings and your
Starting address:
hancinema.net - korean company downloader
Right-click the above link and select
[ Save link as... ]. Then double-click the saved file to add it to Extreme Picture Finder and start the project in the program.