ghost
|
d186fff48f
|
skip curl download on response data size reached
|
2023-05-09 10:21:37 +03:00 |
|
ghost
|
ef4de6b245
|
fix image search page errors
|
2023-05-09 08:53:33 +03:00 |
|
ghost
|
23ead4e12c
|
update page / image description models, implement history snap crawling
|
2023-05-09 08:19:49 +03:00 |
|
ghost
|
0e9d29675f
|
implement host page description history crawling
|
2023-05-09 01:29:32 +03:00 |
|
ghost
|
32d0f390d3
|
update http code and mime type on page/image ban event
|
2023-05-08 14:13:53 +03:00 |
|
ghost
|
8fbd7f3516
|
count totals using sphinx index instead of database
|
2023-05-08 12:28:49 +03:00 |
|
ghost
|
25b6bce2ec
|
add crawler/cleaner logs
|
2023-05-08 11:04:59 +03:00 |
|
ghost
|
ea04220de3
|
add curl requests debug
|
2023-05-08 08:27:21 +03:00 |
|
ghost
|
6c41dd5831
|
fix ban time update / count affected rows only
|
2023-05-06 10:11:25 +03:00 |
|
ghost
|
b6605b9132
|
implement not reachable resources ban feature with timeout to prevent extra http requests
|
2023-05-06 08:45:37 +03:00 |
|
ghost
|
702a14b634
|
add mime content type crawling #1
|
2023-05-06 07:25:54 +03:00 |
|
ghost
|
f88d2ee9ff
|
implement MIME content-type crawler filter
|
2023-05-05 21:25:57 +03:00 |
|
ghost
|
bed5d3f149
|
fix offset out of bounds error
|
2023-05-05 15:16:36 +03:00 |
|
ghost
|
5999fb3a73
|
add distributed hosts crawling using yggo nodes manifest
|
2023-05-05 05:26:53 +03:00 |
|
ghost
|
f0b2eb1613
|
show images total instead of pages in placeholder on image search page
|
2023-05-05 01:42:44 +03:00 |
|
ghost
|
297563d4a5
|
display related pages in priority to the unique host by rank, rand() order
|
2023-05-04 10:53:37 +03:00 |
|
ghost
|
34b7291228
|
add related to image hostpages limit
|
2023-05-04 10:17:47 +03:00 |
|
ghost
|
adc791f378
|
fix updateTime init
|
2023-05-04 10:11:13 +03:00 |
|
ghost
|
d4f66c83e7
|
fix image crawling errors
|
2023-05-04 08:51:45 +03:00 |
|
ghost
|
baa8b0d2f0
|
fix data type formatting
|
2023-05-04 07:58:07 +03:00 |
|
ghost
|
79878d17fe
|
add crawler / proxy user agent settings
|
2023-05-04 07:38:22 +03:00 |
|
ghost
|
73f212e3d7
|
set crawler queue order priority to item rank, rand()
|
2023-05-04 06:55:05 +03:00 |
|
ghost
|
9ed8411d2f
|
add image queue crawler
|
2023-05-04 06:45:04 +03:00 |
|
ghost
|
d905e33b4f
|
update host images info on search requests
|
2023-05-04 06:12:51 +03:00 |
|
ghost
|
68581960a3
|
add image.data field
|
2023-05-04 05:19:29 +03:00 |
|
ghost
|
100d12c6ab
|
update curl library constructor
|
2023-05-04 04:55:26 +03:00 |
|
ghost
|
250e20bbcd
|
remove separator
|
2023-05-04 04:19:38 +03:00 |
|
ghost
|
6b18202588
|
implement proxied image search #1
|
2023-05-04 03:48:57 +03:00 |
|
ghost
|
0741a3e9ef
|
implement image crawler
|
2023-05-04 01:04:39 +03:00 |
|
ghost
|
6d8f4f4882
|
create manifests registry
|
2023-05-03 09:22:14 +03:00 |
|
ghost
|
0bd765064b
|
implement extended search mode support #9
|
2023-05-01 20:09:28 +03:00 |
|
ghost
|
84fd82f294
|
fix replacement typo #9
|
2023-05-01 19:03:14 +03:00 |
|
ghost
|
d40b914983
|
add new chars quoting #9
|
2023-05-01 18:58:03 +03:00 |
|
ghost
|
f7807cf43e
|
add extended syntax filter to prevent sphinxql query error #9
|
2023-05-01 18:39:46 +03:00 |
|
ghost
|
a5f5541395
|
skip robots:noindex page without extra actions
|
2023-04-29 08:58:48 +03:00 |
|
ghost
|
11aa404807
|
add metaYggo field index
|
2023-04-25 21:10:59 +03:00 |
|
ghost
|
8671fc4bde
|
implement page ranking
|
2023-04-25 16:54:01 +03:00 |
|
ghost
|
fcee7f62ef
|
fix max_matches error
|
2023-04-23 09:29:24 +03:00 |
|
ghost
|
9916fb701f
|
implement basic api
|
2023-04-23 03:01:51 +03:00 |
|
ghost
|
e6b1e8029c
|
add missed regex replacement rule
|
2023-04-10 03:18:50 +03:00 |
|
ghost
|
5c8d299a4a
|
add meta:robots tag support #2
|
2023-04-09 03:28:31 +03:00 |
|
ghost
|
8e8d89db0e
|
implement database cleaner
|
2023-04-09 00:06:28 +03:00 |
|
ghost
|
df6f2a1869
|
implement CRAWL_ROBOTS_POSTFIX_RULES configuration #5
|
2023-04-08 22:28:31 +03:00 |
|
ghost
|
2495a2bbc7
|
implement MySQL/Sphinx data model #3, add basical robots.txt support #2
|
2023-04-07 04:04:24 +03:00 |
|
ghost
|
c9cd38f6ac
|
update variable names #2
|
2023-04-04 01:38:32 +03:00 |
|
ghost
|
ed2d4047b4
|
implement robots.txt library #2
|
2023-04-04 00:27:32 +03:00 |
|
ghost
|
e7e4bb686c
|
fix curl exec double call
|
2023-04-03 04:47:31 +03:00 |
|
ghost
|
ff95df72c1
|
implement hostname identicons
|
2023-04-03 01:30:09 +03:00 |
|
ghost
|
4ea01bf8b4
|
implement search results pagination
|
2023-04-02 23:36:35 +03:00 |
|
ghost
|
04dbbc3adf
|
make url/src column ukeys digital by using crc32
|
2023-04-02 18:56:56 +03:00 |
|