hancinema.net - dramas from selected company

Post Reply
kokocrunch
Posts: 9
Joined: 03 Jun 2022, 22:40

hancinema.net - dramas from selected company

Post by kokocrunch »

Hi,
I'm trying to download from this parent page

https://www.hancinema.net/korean_compan ... A.html?p=1 <- company category

going to all the links under this category (drama tags only) for example
https://www.hancinema.net/korean_drama_ ... Studio.php
and download all the images in the picture gallery
https://www.hancinema.net/korean_drama_ ... y.html?p=1

I can't attach the project file that I've come up so far but it's also getting other dramas from other companies as well
Maksym
Site Admin
Posts: 2103
Joined: 02 Mar 2009, 17:02

Re: hancinema.net - dramas from selected company

Post by Maksym »

OK, here is how I work on such projects. First of all: [Site exploration] -> [Regular site], [Search for files within] -> [ Entire site ]. Other exploration types won't work. Since we want only specific pages of the website and the website has way more than we want - we should exclude all pages of the website using this filter in the [ Excluded URLs ]:

^https?://(www\.)?hancinema\.net

And now we can add filters to the [ Included URLs ] to allow the program to crawl only those pages that we want. First, we allow the software to crawl the "company pages" with this filter:

\?p=\d+$

Then we want it to go through the picture galleries, so we add this filter:

/korean_drama_[^/\?]+-picture_gallery\.html([^#]+)?$

Finally, we want it to allow the visits to the full-size photo pages to get the actual full-size photos:

-picture_\d+\.html

That's it. These 3 filters cover all the pages that have to be crawled.

And now the hardest part - we need to prevent Extreme Picture Finder from visiting other companies and picture galleries with addresses that match our filters. This is where [ Excluded Page Parts ] come into play. You need to open the HTML source of every "page type" that will be crawled. We have 3 page types in our project: company pages, picture gallery pages, and full-size photo pages. Every page should be left with the least amount of HTML text necessary to do the job. The rest has to be cut out with the [ Excluded Page Parts ].

I prefer to start with the pages that are closest to the files that we need to save. In this case - the full-size photo pages. This one is actually easy: the link to the full-size photo is located inside the only <picture>...</picture> tag, so we cut everything from the beginning of the HTML text to the first occurrence of the <picture text like this:

From: empty
To: <picture

And now we want to remove everything starting with the closing </picture tag to leave only the text between the <picture and </picture:

From: </picture
To: empty

Do not type the text empty. Just leave the field empty.

In the picture gallery pages, we want to leave only the part that has links to the full-size picture pages. It's located between these tags:

<main

and

<div id="resultmessage">

And for the company pages the content that we want lies between

<main

and

<div class="box article_side section_side">

We also don't need the <img...> tags because they contain pictures that we don't want, so we remove them with this [ Excluded Page Part ]:

From: <img
To: >

Also, we don't want the mid-size photos, so I added this additional filter to the [ Excluded URLs ]:

/photos/photo\d+

I also added the sub-folder creation. Photos from different picture galleries will be saved into separate sub-folders created with this Regular Expression:

Expression: /([^/\?]+)-picture_
Result: [#1]

That's it. That's all the settings. Here is a project file with all of the above settings and your Starting address:

hancinema.net - korean company downloader

Right-click the above link and select [ Save link as... ]. Then double-click the saved file to add it to Extreme Picture Finder and start the project in the program.
kokocrunch
Posts: 9
Joined: 03 Jun 2022, 22:40

Re: hancinema.net - dramas from selected company

Post by kokocrunch »

Hi,

Thank you so much for this template

but I'm getting these links repeatedly

This link is created with depth 7 and it will generate another link with depth 8 and so on... until now it's still running with depth 500

https://www.hancinema.net/korean_drama_ ... &p=2#login

i tried to exclude it by using these
/korean_drama_[^/\?]+-picture_gallery\.html-d+
/korean_drama_[^/\?]+-picture_gallery\.html-([^#]+)?$

I also tried limiting the depth crawl to 10 but it won't get all the page link of the show
kokocrunch
Posts: 9
Joined: 03 Jun 2022, 22:40

Re: hancinema.net - dramas from selected company

Post by kokocrunch »

Hi,

This is what I've noticed so far on this page...it's creating multiple pages for certain shows like this so it will crawl until 500 depth and so on...it's not following the normal link structure....

https://www.hancinema.net/korean_drama_ ... orah&&&p=1
https://www.hancinema.net/korean_drama_ ... &p=2#login

but it's also working with this link... normal link structure

https://www.hancinema.net/korean_drama_ ... y.html?p=1

I tried to exclude the links above with different expressions but no luck it's still creating and crawling to multiple links...

picture_gallery\.html?-_[^/\?]
picture_gallery\.html?-_\d+
picture_gallery\.html?-_([^#]+)?$
picture_gallery\.html?-_([^/]+)
hancinema\.net/korean_drama_[^/\?]+-picture_gallery.html?-_[^/\?]+&&-_[^/\?]+p=\d+
hancinema\.net/korean_drama_\d+-picture_gallery.html?-_\d+&&-_\d+p=([^#]+)?$
hancinema\.net/korean_drama_\d+-picture_gallery.html?-_([^#]+)?$&&-_([^#]+)?$p=([^#]+)?$
hancinema\.net/korean_drama_\d+-picture_gallery.html?[^/\?]#login([^#]+)?$

Here's some more show that is also crawling to multiple links
https://www.hancinema.net/korean_drama_ ... y.html?p=1
https://www.hancinema.net/korean_drama_ ... y.html?p=1
Maksym
Site Admin
Posts: 2103
Joined: 02 Mar 2009, 17:02

Re: hancinema.net - dramas from selected company

Post by Maksym »

[ Included URLs ] are prioritized over the [ Excluded URLs ]. Otherwise, you couldn't have used

^https?://(www\.)?hancinema\.net

in the [ Excluded URLs ] and have and pages on the hancinema.net website crawled. And since you have

/korean_drama_[^/\?]+-picture_gallery\.html([^#]+)?$

in the [ Included URLs ], adding something like

picture_gallery\.html?-_[^/\?]

to the [ Excluded URLs ] will not have any effect because you already allowed such addresses. Thus, you need to modify the filter that allows crawling of pages you don't want. I think this one will do:

/korean_drama_[^/\?]+-picture_gallery\.html(\?p=\d+)?$
Post Reply