How to configure advanced regular expression

Post Reply
smith
Posts: 32
Joined: 07 Jan 2018, 09:51

How to configure advanced regular expression

Post by smith » 03 May 2023, 20:38

Hi maxim,
Thanks for the guidance regarding exploration limit, But I just want know about
1.advanced customized regular expression with result entries
2.How to add manual login regular expression with result
3. search and download image and video with only specified word within a start page and so on


Thanking you

Maksym
Site Admin
Posts: 2085
Joined: 02 Mar 2009, 17:02

Re: How to configure advanced regular expression

Post by Maksym » 08 May 2023, 17:01

Well, I could use actual examples - I'll explain them to you. Let's take question 2: how to use addresses from the built-in browser.

First of all - you have to make sure that addresses that you want to use actually exist in the built-in browser. So, create a project with your Starting address, check the [ Manual login ] box and start the project. Let the page fully load in the built-in browser and then examine both [ JSON requests ] and [ Address list ] tabs of the built-in browser window.

I'll use a girlygirlpic.com downloader template I just created as an example. Please download it and open the template properties to see all the details I'm talking about here. This template had to download albums from different lists (like, all albums of a selected model, or all albums from a selected tag). First of all I checked the page source. Just to find out that no album links are there. Album addresses are loaded dynamically, when you scroll the page down. I was unable to reproduce the requests this website uses for the infinite scroll, so I had to use a built-in browser window to harvest album addresses. Album addresses on this website look like this:

httрs://en.girlygirlpic.com/a/albumID

Now, if you are creating a one-time project and not a template that will work with many similar pages of the website - you can simply check the addresses you want to download right in the [Address list ] tab, just like show on this screenshot:

Image

But if you want to automate the process, you'll have to use a Regular Expression that will automatically "check" the addresses you need. Your Regular Expression has to be very specific, so that only a certain group of addresses is matched. I used this one:

Expression: ^([^\?]+girlygirlpic\.com/a/[^/\?#]+)$
Result: [#1]

Image

You may have noticed that all addresses in the [ JSON requests ] tab are checked by default, so if you want to use any of them - you just have to make sure such addresses are allowed by the exploration limits. And addresses from the [Address list ] tab has to be check manually or with a Regular Expression, plus they also must within the exploration limits.

This is how you use addresses from the built-in browser.

Let's take the example with this template a little further. Having a list of album addresses is not enough to get photos from the albums on this website. Open the source of any album page on this website - you won't see any photo addresses. It means that photo list is loaded dynamically once the album page is opened. Now, how do you know what requests are made from the page when it is opened? I use Chrome's [ Developer Tools ] for this. So, open the album page in browser and press F12 to open the [ Developer Tools ]. Then select the [ Fetch/XHR ] tab to see only dynamically generated requests and reload the page in browser. You should see something like this:

Image

Now you just have to click every address and select the [ Response ] tab to see if selected request has links to photos. I found them here:

Image

Now you can switch to the [ Headers ] tab to see the request address and type.

Image

This request type is POST. It means that you need to know not only the request address, but also the POST Data or [ Payload ]. Luckily, it's not much here, just an album ID.

Image

Now you have all the information that is required to get a list of photos from any album on this website: you have to generate a POST request for every album ID you want to download. This can be done only with [ Custom Parsers ] in Extreme Picture Finder. The address of the request is the same for all albums, so the only information you will need to find is the album ID. [ Custom Parsers ] are applied to the source of the downloaded page and the result must contain full URL (absolute or relative). So open the source of any album page in your browser and see if album ID is mentioned anywhere on the page. The easiest way to find it is to press [ Ctrl + F ] and paste the album ID into the Search field. And this is what I found:

<input id=albumId type=hidden value=r9195b3p83>

This is exactly what I need. Now I just need to create a Regular Expression that will match only that album ID and nothing else. That's pretty easy:

<input id=albumId type=hidden value=([^>]+)>

You can check if your Regular Expression works right in the project properties, without running the project. There is a [ Test ] button under the Regular Expression list. Click it, paste the page source into the corresponding field and click [ Test Regular Expressions ] button.

So, the final resulting Custom Parser is:

Expression: <input id=albumId type=hidden value=([^>]+)>
Result: https://en.girlygirlpic.com/ax/
Request type: POST
POST data: {"album_id":"[#1]"}

Image

Now you just have to make sure that URL from the [ Result ] field is within the exploration limits. I simply added:

^[^\?]+ \.com/ax/$

to the [ Included URLs ].

That is the way I work with the dynamic requests generated "on the fly" by website's JavaScript. Not a rocket science, but pretty close to it. Feel free to ask me any questions about the above article.

Post Reply