Page 1 of 1

Links with extra "\\\"

Posted: 28 Nov 2024, 20:58
by madnauseam
Hi,

I was creating a custom template and I ended up with a lot of links that end with multiple "\\\\" chars.
I believe the correct link is still parsed, but this generates a lot of needless links.

Do you have any idea what could be causing this issue?

Thanks

EDIT: I should have added a bit more info.


Take this link (NSWF)

Code: Select all

https://forum.candidgirls.io/t/fit-latina-in-onepiece-flex-gif/499923
In the Excluded links I have:

Code: Select all

^https?://(www\.)?candidgirls\.io
https://forum.candidgirls.io/privacy
https://forum.candidgirls.io/login
https://forum.candidgirls.io/tos
https://forum.candidgirls.io/guidelines
https://forum.candidgirls.io/categories
https://forum.candidgirls.io/u
https://forum.candidgirls.io/tag
https://forum.candidgirls.io/letter_avatar_proxy
images/emoji/
https://forum.candidgirls.io/c
In the Included links I have:

Code: Select all

https://forum.candidgirls.io/t/
https://forum.candidgirls.io/uploads/    (this one is where the attachments are)
I have played a bit with the excluded pages parts in order to parse out needless code other than the image links, but I am still getting these extra links.

Once again, thank you.

Re: Links with extra "\\\"

Posted: 29 Nov 2024, 11:24
by Maksym
It looks like all the links with the extra "\" characters are located before the "</header>" tag, so adding one [ Excluded Page Part ] like this should do the job:

Code: Select all

From:
To: </header>
Another note. Why do you exclude www.candidgirls.io? There are no links to that website from the forum. I think the only Excluded URL (if you want to go that way) that you need is

Code: Select all

candidgirls\.io
And the only [Included URL] that you need is

Code: Select all

/original/
I think you are overcomplicating things.

Re: Links with extra "\\\"

Posted: 29 Nov 2024, 17:08
by madnauseam
Maksym wrote: 29 Nov 2024, 11:24 It looks like all the links with the extra "\" characters are located before the "</header>" tag, so adding one [ Excluded Page Part ] like this should do the job:

Code: Select all

From:
To: </header>
Another note. Why do you exclude www.candidgirls.io? There are no links to that website from the forum. I think the only Excluded URL (if you want to go that way) that you need is

Code: Select all

candidgirls\.io
And the only [Included URL] that you need is

Code: Select all

/original/
I think you are overcomplicating things.
Thanks for your input, Maksym!

The idea I had in mind was to crawl a category, for example:
and download each thread from that category.

Perhaps I'm not going in the right direction and there is an easier way to do that?

Re: Links with extra "\\\"

Posted: 30 Nov 2024, 11:51
by Maksym
You never mentioned you wanted the categories :( How did you plan to handle the pagination of the category pages?

Re: Links with extra "\\\"

Posted: 30 Nov 2024, 13:11
by madnauseam
So far I was allowing links in the form of " ...\t\ " and I was using the "Excluded page parts" to remove the links for other threads other than &page=.

But yeah, I am a noob with this, so I'm sure there are better ways to do this.

Re: Links with extra "\\\"

Posted: 30 Nov 2024, 13:21
by Maksym
Yeah, that's the easiest way. Simply add

/t/

and

page=\d+

to the [ Included URLs ]

Re: Links with extra "\\\"

Posted: 30 Nov 2024, 13:38
by madnauseam
Thank you for all your help!