Compare commits
14 Commits
4e868ca8e9
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 1588f83624 | |||
| c40d44e4c9 | |||
| e6b75410ed | |||
| f7be09ec63 | |||
| 9e61572747 | |||
| 11fd8fa673 | |||
| cbf099701b | |||
| ad274c0738 | |||
| de4d2e53d9 | |||
| daa76b2141 | |||
| 09d5b61779 | |||
| e569d189d5 | |||
| b5640ad131 | |||
| 5b5a627662 |
1
.gitignore
vendored
1
.gitignore
vendored
@@ -23,3 +23,4 @@ Thumbs.db
|
||||
|
||||
# Docker
|
||||
docker-compose.override.yml
|
||||
/.claude/settings.local.json
|
||||
|
||||
100
AGENTS.md
100
AGENTS.md
@@ -4,13 +4,109 @@
|
||||
The codebase is intentionally lean. `index.php` bootstraps the crawl by instantiating `webanalyse` and handing off the crawl identifier. Core crawling logic lives in `webanalyse.php`, which houses HTTP fetching, link extraction, and database persistence. Use `setnew.php` to reset seed data inside the `screaming_frog` schema before a rerun. Keep new helpers in their own PHP files under this root so the autoload includes stay predictable; group SQL migrations or fixtures under a `database/` folder if you add them. IDE settings reside in `.idea/`.
|
||||
|
||||
## Build, Test, and Development Commands
|
||||
Run the project through Apache in XAMPP or start the PHP built-in server with `php -S localhost:8080 index.php` from this directory. Validate syntax quickly via `php -l webanalyse.php` (repeat for any new file). When iterating on crawl logic, truncate runtime tables with `php setnew.php` to restore the baseline dataset.
|
||||
|
||||
### Docker Development
|
||||
The project runs in Docker containers. Use these commands:
|
||||
|
||||
```bash
|
||||
# Start containers
|
||||
docker-compose up -d
|
||||
|
||||
# Stop containers
|
||||
docker-compose down
|
||||
|
||||
# Rebuild containers
|
||||
docker-compose up -d --build
|
||||
|
||||
# View logs
|
||||
docker-compose logs -f php
|
||||
```
|
||||
|
||||
### Running Tests
|
||||
The project uses PHPUnit for automated testing:
|
||||
|
||||
```bash
|
||||
# Run all tests (Unit + Integration)
|
||||
docker-compose exec php sh -c "php /var/www/html/vendor/bin/phpunit /var/www/tests/"
|
||||
|
||||
# Or use the composer shortcut
|
||||
docker-compose exec php composer test
|
||||
```
|
||||
|
||||
**Test Structure:**
|
||||
- `tests/Unit/` - Unit tests for individual components
|
||||
- `tests/Integration/` - Integration tests for full crawl workflows
|
||||
- All tests run in isolated database transactions
|
||||
|
||||
### Static Code Analysis
|
||||
PHPStan is configured at Level 8 (strictest) to ensure type safety:
|
||||
|
||||
```bash
|
||||
# Run PHPStan analysis
|
||||
docker-compose exec php sh -c "php -d memory_limit=512M /var/www/html/vendor/bin/phpstan analyse -c /var/www/phpstan.neon"
|
||||
|
||||
# Or use the composer shortcut
|
||||
docker-compose exec php composer phpstan
|
||||
```
|
||||
|
||||
**PHPStan Configuration:**
|
||||
- Level: 8 (maximum strictness)
|
||||
- Analyzes: `src/` and `tests/`
|
||||
- Excludes: `vendor/`
|
||||
- Config file: `phpstan.neon`
|
||||
|
||||
All code must pass PHPStan Level 8 with zero errors before merging.
|
||||
|
||||
### Code Style Checking
|
||||
PHP_CodeSniffer enforces PSR-12 coding standards:
|
||||
|
||||
```bash
|
||||
# Check code style
|
||||
docker-compose exec php composer phpcs
|
||||
|
||||
# Automatically fix code style issues
|
||||
docker-compose exec php composer phpcbf
|
||||
```
|
||||
|
||||
**PHPCS Configuration:**
|
||||
- Standard: PSR-12
|
||||
- Analyzes: `src/` and `tests/`
|
||||
- Excludes: `vendor/`
|
||||
- Auto-fix available via `phpcbf`
|
||||
|
||||
Run `phpcbf` before committing to automatically fix most style violations.
|
||||
|
||||
## Coding Style & Naming Conventions
|
||||
Follow PSR-12 style cues already in use: 4-space indentation, brace-on-new-line for functions, and `declare(strict_types=1);` at the top of entry scripts. Favour descriptive camelCase for methods (`getMultipleWebsites`) and snake_case only for direct SQL field names. Maintain `mysqli` usage for consistency, and gate new configuration through constants or clearly named environment variables.
|
||||
|
||||
## Testing Guidelines
|
||||
There is no automated suite yet; treat each crawl as an integration test. After code changes, run `php setnew.php` followed by a crawl and confirm that `crawl`, `urls`, and `links` tables reflect the expected row counts. Log anomalies with `error_log()` while developing, and remove or downgrade to structured responses before merging.
|
||||
|
||||
### Automated Testing
|
||||
The project has a comprehensive test suite using PHPUnit:
|
||||
|
||||
- **Write tests first**: Follow TDD principles when adding new features
|
||||
- **Unit tests** (`tests/Unit/`): Test individual classes and methods in isolation
|
||||
- **Integration tests** (`tests/Integration/`): Test full crawl workflows with real HTTP requests
|
||||
- **Database isolation**: Tests use transactions that roll back automatically
|
||||
- **Coverage**: Aim for high test coverage on critical crawl logic
|
||||
|
||||
### Quality Gates
|
||||
Before committing code, ensure:
|
||||
1. All tests pass: `docker-compose exec php composer test`
|
||||
2. PHPStan analysis passes: `docker-compose exec php composer phpstan`
|
||||
3. Code style is correct: `docker-compose exec php composer phpcs`
|
||||
4. Auto-fix style issues: `docker-compose exec php composer phpcbf`
|
||||
|
||||
**Pre-commit Checklist:**
|
||||
- ✅ Tests pass
|
||||
- ✅ PHPStan Level 8 with 0 errors
|
||||
- ✅ PHPCS PSR-12 compliance (warnings acceptable)
|
||||
|
||||
### Manual Testing
|
||||
For UI changes, manually test the crawler interface at http://localhost:8080. Verify:
|
||||
- Job creation and status updates
|
||||
- Page and link extraction accuracy
|
||||
- Error handling for invalid URLs or network issues
|
||||
|
||||
## Commit & Pull Request Guidelines
|
||||
Author commit messages in the present tense with a concise summary (`Add link grouping for external URLs`). Group related SQL adjustments with their PHP changes in the same commit. For pull requests, include: a short context paragraph, reproduction steps, screenshots of key output tables when behaviour changes, and any follow-up tasks. Link tracking tickets or issues so downstream agents can trace decisions.
|
||||
|
||||
89
README.md
89
README.md
@@ -1,7 +1,17 @@
|
||||
# PHP Docker Anwendung
|
||||
# Web Crawler
|
||||
|
||||
Eine PHP-Anwendung mit MariaDB, die in Docker läuft.
|
||||
|
||||
## Copyright & Lizenz
|
||||
|
||||
**Copyright © 2025 Martin Kiesewetter**
|
||||
|
||||
- **Autor:** Martin Kiesewetter
|
||||
- **E-Mail:** mki@kies-media.de
|
||||
- **Website:** [https://kies-media.de](https://kies-media.de)
|
||||
|
||||
---
|
||||
|
||||
## Anforderungen
|
||||
|
||||
- Docker
|
||||
@@ -43,16 +53,79 @@ docker-compose up -d --build
|
||||
```
|
||||
.
|
||||
├── docker-compose.yml # Docker Compose Konfiguration
|
||||
├── Dockerfile # PHP Container Image
|
||||
├── start.sh # Container Start-Script
|
||||
├── init.sql # Datenbank Initialisierung
|
||||
├── config/
|
||||
├── Dockerfile # PHP Container Image
|
||||
├── config/ # Konfigurationsdateien
|
||||
│ ├── docker/
|
||||
│ │ ├── init.sql # Datenbank Initialisierung
|
||||
│ │ └── start.sh # Container Start-Script (unused)
|
||||
│ └── nginx/
|
||||
│ └── default.conf # Nginx Konfiguration
|
||||
└── src/
|
||||
└── index.php # Hauptanwendung
|
||||
│ └── default.conf # Nginx Konfiguration
|
||||
├── src/ # Anwendungscode
|
||||
│ ├── api.php
|
||||
│ ├── index.php
|
||||
│ ├── classes/
|
||||
│ └── crawler-worker.php
|
||||
├── tests/ # Test Suite
|
||||
│ ├── Unit/
|
||||
│ └── Integration/
|
||||
├── phpstan.neon # PHPStan Konfiguration
|
||||
└── phpcs.xml # PHPCS Konfiguration
|
||||
```
|
||||
|
||||
## Entwicklung
|
||||
|
||||
Die Anwendungsdateien befinden sich im `src/` Verzeichnis und werden als Volume in den Container gemountet, sodass Änderungen sofort sichtbar sind.
|
||||
|
||||
## Tests & Code-Qualität
|
||||
|
||||
### Unit Tests ausführen
|
||||
|
||||
Die Anwendung verwendet PHPUnit für Unit- und Integrationstests:
|
||||
|
||||
```bash
|
||||
# Alle Tests ausführen
|
||||
docker-compose exec php sh -c "php /var/www/html/vendor/bin/phpunit /var/www/tests/"
|
||||
|
||||
# Alternative mit Composer-Script
|
||||
docker-compose exec php composer test
|
||||
```
|
||||
|
||||
Die Tests befinden sich in:
|
||||
- `tests/Unit/` - Unit Tests
|
||||
- `tests/Integration/` - Integration Tests
|
||||
|
||||
### Statische Code-Analyse mit PHPStan
|
||||
|
||||
PHPStan ist auf Level 8 (höchstes Level) konfiguriert und analysiert den gesamten Code:
|
||||
|
||||
```bash
|
||||
# PHPStan ausführen
|
||||
docker-compose exec php sh -c "php -d memory_limit=512M /var/www/html/vendor/bin/phpstan analyse -c /var/www/phpstan.neon"
|
||||
|
||||
# Alternative mit Composer-Script
|
||||
docker-compose exec php composer phpstan
|
||||
```
|
||||
|
||||
**PHPStan Konfiguration:**
|
||||
- Level: 8 (strictest)
|
||||
- Analysierte Pfade: `src/` und `tests/`
|
||||
- Ausgeschlossen: `vendor/` Ordner
|
||||
- Konfigurationsdatei: `phpstan.neon`
|
||||
|
||||
### Code Style Prüfung mit PHP_CodeSniffer
|
||||
|
||||
PHP_CodeSniffer (PHPCS) prüft den Code gegen PSR-12 Standards:
|
||||
|
||||
```bash
|
||||
# Code Style prüfen
|
||||
docker-compose exec php composer phpcs
|
||||
|
||||
# Code Style automatisch korrigieren
|
||||
docker-compose exec php composer phpcbf
|
||||
```
|
||||
|
||||
**PHPCS Konfiguration:**
|
||||
- Standard: PSR-12
|
||||
- Analysierte Pfade: `src/` und `tests/`
|
||||
- Ausgeschlossen: `vendor/` Ordner
|
||||
- Auto-Fix verfügbar mit `phpcbf`
|
||||
|
||||
@@ -1,4 +1,5 @@
|
||||
{
|
||||
"_comment": "Web Crawler - Composer Configuration | Copyright (c) 2025 Martin Kiesewetter <mki@kies-media.de> | https://kies-media.de",
|
||||
"name": "web-crawler/app",
|
||||
"description": "Web Crawler Application with Parallel Processing",
|
||||
"type": "project",
|
||||
|
||||
@@ -1,3 +1,11 @@
|
||||
/**
|
||||
* Web Crawler - Database Schema
|
||||
*
|
||||
* @copyright Copyright (c) 2025 Martin Kiesewetter
|
||||
* @author Martin Kiesewetter <mki@kies-media.de>
|
||||
* @link https://kies-media.de
|
||||
*/
|
||||
|
||||
-- Database initialization script for Web Crawler
|
||||
|
||||
-- Crawl Jobs Table
|
||||
@@ -20,12 +28,17 @@ CREATE TABLE IF NOT EXISTS pages (
|
||||
crawl_job_id INT NOT NULL,
|
||||
url VARCHAR(2048) NOT NULL,
|
||||
title VARCHAR(500),
|
||||
meta_description TEXT,
|
||||
status_code INT,
|
||||
content_type VARCHAR(100),
|
||||
redirect_url VARCHAR(2048),
|
||||
redirect_count INT DEFAULT 0,
|
||||
crawled_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
FOREIGN KEY (crawl_job_id) REFERENCES crawl_jobs(id) ON DELETE CASCADE,
|
||||
INDEX idx_crawl_job (crawl_job_id),
|
||||
INDEX idx_url (url(255)),
|
||||
INDEX idx_status_code (status_code),
|
||||
INDEX idx_redirect_count (redirect_count),
|
||||
UNIQUE KEY unique_job_url (crawl_job_id, url(255))
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
# Web Crawler - Docker Compose Configuration
|
||||
#
|
||||
# @copyright Copyright (c) 2025 Martin Kiesewetter
|
||||
# @author Martin Kiesewetter <mki@kies-media.de>
|
||||
# @link https://kies-media.de
|
||||
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
@@ -10,6 +16,11 @@ services:
|
||||
- "8080:80"
|
||||
volumes:
|
||||
- ./src:/var/www/html
|
||||
- ./tests:/var/www/tests
|
||||
- ./composer.json:/var/www/composer.json
|
||||
- ./composer.lock:/var/www/composer.lock
|
||||
- ./phpstan.neon:/var/www/phpstan.neon
|
||||
- ./phpcs.xml:/var/www/phpcs.xml
|
||||
- ./config/nginx/default.conf:/etc/nginx/conf.d/default.conf
|
||||
depends_on:
|
||||
- mariadb
|
||||
@@ -29,7 +40,7 @@ services:
|
||||
- "3307:3306"
|
||||
volumes:
|
||||
- mariadb_data:/var/lib/mysql
|
||||
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
|
||||
- ./config/docker/init.sql:/docker-entrypoint-initdb.d/init.sql
|
||||
networks:
|
||||
- app-network
|
||||
|
||||
|
||||
11
index.php
11
index.php
@@ -1,11 +0,0 @@
|
||||
<?php
|
||||
declare(strict_types=1);
|
||||
|
||||
error_reporting(E_ALL);
|
||||
ini_set('display_errors', '1');
|
||||
|
||||
require_once 'webanalyse.php';
|
||||
$wa = new WebAnalyse();
|
||||
$db = mysqli_connect('localhost', 'root', '', 'screaming_frog');
|
||||
|
||||
$wa->doCrawl(1);
|
||||
19
phpcs.xml
Normal file
19
phpcs.xml
Normal file
@@ -0,0 +1,19 @@
|
||||
<?xml version="1.0"?>
|
||||
<ruleset name="ScreamingFrog">
|
||||
<description>PHP_CodeSniffer configuration</description>
|
||||
|
||||
<!-- Use PSR-12 coding standard -->
|
||||
<rule ref="PSR12"/>
|
||||
|
||||
<!-- Paths to check -->
|
||||
<file>/var/www/html</file>
|
||||
<file>/var/www/tests</file>
|
||||
|
||||
<!-- Exclude vendor directory -->
|
||||
<exclude-pattern>/var/www/html/vendor/*</exclude-pattern>
|
||||
<exclude-pattern>*/vendor/*</exclude-pattern>
|
||||
|
||||
<!-- Show progress and colors -->
|
||||
<arg name="colors"/>
|
||||
<arg value="sp"/>
|
||||
</ruleset>
|
||||
7
phpstan.neon
Normal file
7
phpstan.neon
Normal file
@@ -0,0 +1,7 @@
|
||||
parameters:
|
||||
level: 8
|
||||
paths:
|
||||
- /var/www/html
|
||||
- /var/www/tests
|
||||
excludePaths:
|
||||
- /var/www/html/vendor
|
||||
11
setnew.php
11
setnew.php
@@ -1,11 +0,0 @@
|
||||
<?php
|
||||
$db = mysqli_connect("localhost", "root", "", "screaming_frog");
|
||||
|
||||
$db->query("truncate table crawl");
|
||||
// $db->query("insert into crawl (start_url, user_id) values ('https://kies-media.de/', 1)");
|
||||
$db->query("insert into crawl (start_url, user_id) values ('https://kies-media.de/leistungen/externer-ausbilder-fuer-fachinformatiker/', 1)");
|
||||
|
||||
$db->query("truncate table urls");
|
||||
$urls = $db->query("insert ignore into urls (id, url, crawl_id) select 1,start_url, id from crawl where id = 1"); #->fetch_all(MYSQLI_ASSOC)
|
||||
|
||||
$db->query("truncate table links");
|
||||
189
src/api.php
189
src/api.php
@@ -1,5 +1,13 @@
|
||||
<?php
|
||||
|
||||
/**
|
||||
* Web Crawler - API Endpoint
|
||||
*
|
||||
* @copyright Copyright (c) 2025 Martin Kiesewetter
|
||||
* @author Martin Kiesewetter <mki@kies-media.de>
|
||||
* @link https://kies-media.de
|
||||
*/
|
||||
|
||||
require_once __DIR__ . '/vendor/autoload.php';
|
||||
|
||||
use App\Database;
|
||||
@@ -73,6 +81,9 @@ try {
|
||||
|
||||
case 'jobs':
|
||||
$stmt = $db->query("SELECT * FROM crawl_jobs ORDER BY created_at DESC LIMIT 50");
|
||||
if ($stmt === false) {
|
||||
throw new Exception('Failed to query jobs');
|
||||
}
|
||||
$jobs = $stmt->fetchAll();
|
||||
|
||||
echo json_encode([
|
||||
@@ -105,6 +116,148 @@ try {
|
||||
]);
|
||||
break;
|
||||
|
||||
case 'broken-links':
|
||||
$jobId = $_GET['job_id'] ?? 0;
|
||||
$stmt = $db->prepare(
|
||||
"SELECT * FROM pages " .
|
||||
"WHERE crawl_job_id = ? AND (status_code >= 400 OR status_code = 0) " .
|
||||
"ORDER BY status_code DESC, url"
|
||||
);
|
||||
$stmt->execute([$jobId]);
|
||||
$brokenLinks = $stmt->fetchAll();
|
||||
|
||||
echo json_encode([
|
||||
'success' => true,
|
||||
'broken_links' => $brokenLinks
|
||||
]);
|
||||
break;
|
||||
|
||||
case 'seo-analysis':
|
||||
$jobId = $_GET['job_id'] ?? 0;
|
||||
$stmt = $db->prepare(
|
||||
"SELECT id, url, title, meta_description, status_code FROM pages " .
|
||||
"WHERE crawl_job_id = ? ORDER BY url"
|
||||
);
|
||||
$stmt->execute([$jobId]);
|
||||
$pages = $stmt->fetchAll();
|
||||
|
||||
$issues = [];
|
||||
foreach ($pages as $page) {
|
||||
$pageIssues = [];
|
||||
$titleLen = mb_strlen($page['title'] ?? '');
|
||||
$descLen = mb_strlen($page['meta_description'] ?? '');
|
||||
|
||||
// Title issues (Google: 50-60 chars optimal)
|
||||
if (empty($page['title'])) {
|
||||
$pageIssues[] = 'Title missing';
|
||||
} elseif ($titleLen < 30) {
|
||||
$pageIssues[] = "Title too short ({$titleLen} chars)";
|
||||
} elseif ($titleLen > 60) {
|
||||
$pageIssues[] = "Title too long ({$titleLen} chars)";
|
||||
}
|
||||
|
||||
// Meta description issues (Google: 120-160 chars optimal)
|
||||
if (empty($page['meta_description'])) {
|
||||
$pageIssues[] = 'Meta description missing';
|
||||
} elseif ($descLen < 70) {
|
||||
$pageIssues[] = "Meta description too short ({$descLen} chars)";
|
||||
} elseif ($descLen > 160) {
|
||||
$pageIssues[] = "Meta description too long ({$descLen} chars)";
|
||||
}
|
||||
|
||||
if (!empty($pageIssues)) {
|
||||
$issues[] = [
|
||||
'url' => $page['url'],
|
||||
'title' => $page['title'],
|
||||
'title_length' => $titleLen,
|
||||
'meta_description' => $page['meta_description'],
|
||||
'meta_length' => $descLen,
|
||||
'issues' => $pageIssues
|
||||
];
|
||||
}
|
||||
}
|
||||
|
||||
// Find duplicates
|
||||
$titleCounts = [];
|
||||
$descCounts = [];
|
||||
foreach ($pages as $page) {
|
||||
if (!empty($page['title'])) {
|
||||
$titleCounts[$page['title']][] = $page['url'];
|
||||
}
|
||||
if (!empty($page['meta_description'])) {
|
||||
$descCounts[$page['meta_description']][] = $page['url'];
|
||||
}
|
||||
}
|
||||
|
||||
$duplicates = [];
|
||||
foreach ($titleCounts as $title => $urls) {
|
||||
if (count($urls) > 1) {
|
||||
$duplicates[] = [
|
||||
'type' => 'title',
|
||||
'content' => $title,
|
||||
'urls' => $urls
|
||||
];
|
||||
}
|
||||
}
|
||||
foreach ($descCounts as $desc => $urls) {
|
||||
if (count($urls) > 1) {
|
||||
$duplicates[] = [
|
||||
'type' => 'meta_description',
|
||||
'content' => $desc,
|
||||
'urls' => $urls
|
||||
];
|
||||
}
|
||||
}
|
||||
|
||||
echo json_encode([
|
||||
'success' => true,
|
||||
'issues' => $issues,
|
||||
'duplicates' => $duplicates,
|
||||
'total_pages' => count($pages)
|
||||
]);
|
||||
break;
|
||||
|
||||
case 'redirects':
|
||||
$jobId = $_GET['job_id'] ?? 0;
|
||||
$stmt = $db->prepare(
|
||||
"SELECT url, title, status_code, redirect_url, redirect_count FROM pages " .
|
||||
"WHERE crawl_job_id = ? AND redirect_count > 0 " .
|
||||
"ORDER BY redirect_count DESC, url"
|
||||
);
|
||||
$stmt->execute([$jobId]);
|
||||
$redirects = $stmt->fetchAll();
|
||||
|
||||
// Count redirect types
|
||||
$permanent = 0;
|
||||
$temporary = 0;
|
||||
$excessive = 0;
|
||||
$maxThreshold = 3; // From Config::MAX_REDIRECT_THRESHOLD
|
||||
|
||||
foreach ($redirects as $redirect) {
|
||||
$code = $redirect['status_code'];
|
||||
if ($code == 301 || $code == 308) {
|
||||
$permanent++;
|
||||
} elseif ($code == 302 || $code == 303 || $code == 307) {
|
||||
$temporary++;
|
||||
}
|
||||
if ($redirect['redirect_count'] > $maxThreshold) {
|
||||
$excessive++;
|
||||
}
|
||||
}
|
||||
|
||||
echo json_encode([
|
||||
'success' => true,
|
||||
'redirects' => $redirects,
|
||||
'stats' => [
|
||||
'total' => count($redirects),
|
||||
'permanent' => $permanent,
|
||||
'temporary' => $temporary,
|
||||
'excessive' => $excessive,
|
||||
'threshold' => $maxThreshold
|
||||
]
|
||||
]);
|
||||
break;
|
||||
|
||||
case 'delete':
|
||||
$jobId = $_POST['job_id'] ?? 0;
|
||||
$stmt = $db->prepare("DELETE FROM crawl_jobs WHERE id = ?");
|
||||
@@ -116,6 +269,42 @@ try {
|
||||
]);
|
||||
break;
|
||||
|
||||
case 'recrawl':
|
||||
$jobId = $_POST['job_id'] ?? 0;
|
||||
$domain = $_POST['domain'] ?? '';
|
||||
|
||||
if (empty($domain)) {
|
||||
throw new Exception('Domain is required');
|
||||
}
|
||||
|
||||
// Delete all related data for this job
|
||||
$stmt = $db->prepare("DELETE FROM crawl_queue WHERE crawl_job_id = ?");
|
||||
$stmt->execute([$jobId]);
|
||||
|
||||
$stmt = $db->prepare("DELETE FROM links WHERE crawl_job_id = ?");
|
||||
$stmt->execute([$jobId]);
|
||||
|
||||
$stmt = $db->prepare("DELETE FROM pages WHERE crawl_job_id = ?");
|
||||
$stmt->execute([$jobId]);
|
||||
|
||||
// Reset job status
|
||||
$stmt = $db->prepare(
|
||||
"UPDATE crawl_jobs SET status = 'pending', total_pages = 0, total_links = 0, " .
|
||||
"started_at = NULL, completed_at = NULL WHERE id = ?"
|
||||
);
|
||||
$stmt->execute([$jobId]);
|
||||
|
||||
// Start crawling in background
|
||||
$cmd = "php " . __DIR__ . "/crawler-worker.php $jobId > /dev/null 2>&1 &";
|
||||
exec($cmd);
|
||||
|
||||
echo json_encode([
|
||||
'success' => true,
|
||||
'job_id' => $jobId,
|
||||
'message' => 'Recrawl started'
|
||||
]);
|
||||
break;
|
||||
|
||||
default:
|
||||
throw new Exception('Invalid action');
|
||||
}
|
||||
|
||||
29
src/classes/Config.php
Normal file
29
src/classes/Config.php
Normal file
@@ -0,0 +1,29 @@
|
||||
<?php
|
||||
|
||||
/**
|
||||
* Web Crawler - Configuration Class
|
||||
*
|
||||
* @copyright Copyright (c) 2025 Martin Kiesewetter
|
||||
* @author Martin Kiesewetter <mki@kies-media.de>
|
||||
* @link https://kies-media.de
|
||||
*/
|
||||
|
||||
namespace App;
|
||||
|
||||
class Config
|
||||
{
|
||||
/**
|
||||
* Maximum number of redirects before warning
|
||||
*/
|
||||
public const int MAX_REDIRECT_THRESHOLD = 3;
|
||||
|
||||
/**
|
||||
* Maximum crawl depth
|
||||
*/
|
||||
public const int MAX_CRAWL_DEPTH = 50;
|
||||
|
||||
/**
|
||||
* Number of parallel requests
|
||||
*/
|
||||
public const int CONCURRENCY = 10;
|
||||
}
|
||||
@@ -1,5 +1,13 @@
|
||||
<?php
|
||||
|
||||
/**
|
||||
* Web Crawler - Crawler Class
|
||||
*
|
||||
* @copyright Copyright (c) 2025 Martin Kiesewetter
|
||||
* @author Martin Kiesewetter <mki@kies-media.de>
|
||||
* @link https://kies-media.de
|
||||
*/
|
||||
|
||||
namespace App;
|
||||
|
||||
use GuzzleHttp\Client;
|
||||
@@ -8,28 +16,37 @@ use GuzzleHttp\Psr7\Request;
|
||||
use GuzzleHttp\Exception\RequestException;
|
||||
use Symfony\Component\DomCrawler\Crawler as DomCrawler;
|
||||
|
||||
class Crawler {
|
||||
class Crawler
|
||||
{
|
||||
private \PDO $db;
|
||||
private Client $client;
|
||||
private int $concurrency = 10; // Parallel requests
|
||||
/** @var array<string, bool> */
|
||||
private array $visited = [];
|
||||
private int $crawlJobId;
|
||||
private string $baseDomain;
|
||||
|
||||
public function __construct(int $crawlJobId) {
|
||||
public function __construct(int $crawlJobId)
|
||||
{
|
||||
$this->db = Database::getInstance();
|
||||
$this->crawlJobId = $crawlJobId;
|
||||
$this->client = new Client([
|
||||
'timeout' => 30,
|
||||
'verify' => false,
|
||||
'allow_redirects' => [
|
||||
'max' => 10,
|
||||
'track_redirects' => true
|
||||
],
|
||||
'headers' => [
|
||||
'User-Agent' => 'WebCrawler/1.0'
|
||||
]
|
||||
]);
|
||||
}
|
||||
|
||||
public function start(string $startUrl): void {
|
||||
$this->baseDomain = strtolower(parse_url($startUrl, PHP_URL_HOST));
|
||||
public function start(string $startUrl): void
|
||||
{
|
||||
$host = parse_url($startUrl, PHP_URL_HOST);
|
||||
$this->baseDomain = strtolower($host ?: '');
|
||||
|
||||
// Update job status
|
||||
$stmt = $this->db->prepare("UPDATE crawl_jobs SET status = 'running', started_at = NOW() WHERE id = ?");
|
||||
@@ -48,7 +65,8 @@ class Crawler {
|
||||
$stmt->execute([$this->crawlJobId]);
|
||||
}
|
||||
|
||||
private function addToQueue(string $url, int $depth): void {
|
||||
private function addToQueue(string $url, int $depth): void
|
||||
{
|
||||
if (isset($this->visited[$url])) {
|
||||
return;
|
||||
}
|
||||
@@ -63,7 +81,8 @@ class Crawler {
|
||||
}
|
||||
}
|
||||
|
||||
private function processQueue(): void {
|
||||
private function processQueue(): void
|
||||
{
|
||||
while (true) {
|
||||
// Get pending URLs
|
||||
$stmt = $this->db->prepare(
|
||||
@@ -82,14 +101,18 @@ class Crawler {
|
||||
}
|
||||
}
|
||||
|
||||
private function crawlBatch(array $urls): void {
|
||||
$requests = function() use ($urls) {
|
||||
/**
|
||||
* @param array<int, array{id: int, url: string, depth: int}> $urls
|
||||
*/
|
||||
private function crawlBatch(array $urls): void
|
||||
{
|
||||
$requests = function () use ($urls) {
|
||||
foreach ($urls as $item) {
|
||||
// Mark as processing
|
||||
$stmt = $this->db->prepare("UPDATE crawl_queue SET status = 'processing' WHERE id = ?");
|
||||
$stmt->execute([$item['id']]);
|
||||
|
||||
yield function() use ($item) {
|
||||
yield function () use ($item) {
|
||||
return $this->client->getAsync($item['url']);
|
||||
};
|
||||
}
|
||||
@@ -110,7 +133,12 @@ class Crawler {
|
||||
$pool->promise()->wait();
|
||||
}
|
||||
|
||||
private function handleResponse(array $queueItem, $response): void {
|
||||
/**
|
||||
* @param array{id: int, url: string, depth: int} $queueItem
|
||||
* @param \Psr\Http\Message\ResponseInterface $response
|
||||
*/
|
||||
private function handleResponse(array $queueItem, $response): void
|
||||
{
|
||||
$url = $queueItem['url'];
|
||||
$depth = $queueItem['depth'];
|
||||
|
||||
@@ -120,30 +148,61 @@ class Crawler {
|
||||
$contentType = $response->getHeaderLine('Content-Type');
|
||||
$body = $response->getBody()->getContents();
|
||||
|
||||
// Track redirects
|
||||
$redirectUrl = null;
|
||||
$redirectCount = 0;
|
||||
if ($response->hasHeader('X-Guzzle-Redirect-History')) {
|
||||
$redirectHistory = $response->getHeader('X-Guzzle-Redirect-History');
|
||||
$redirectCount = count($redirectHistory);
|
||||
if ($redirectCount > 0) {
|
||||
$redirectUrl = end($redirectHistory);
|
||||
}
|
||||
}
|
||||
|
||||
// Save page
|
||||
$domCrawler = new DomCrawler($body, $url);
|
||||
$title = $domCrawler->filter('title')->count() > 0
|
||||
? $domCrawler->filter('title')->text()
|
||||
: '';
|
||||
|
||||
$metaDescription = $domCrawler->filter('meta[name="description"]')->count() > 0
|
||||
? $domCrawler->filter('meta[name="description"]')->attr('content')
|
||||
: '';
|
||||
|
||||
$stmt = $this->db->prepare(
|
||||
"INSERT INTO pages (crawl_job_id, url, title, status_code, content_type)
|
||||
VALUES (?, ?, ?, ?, ?)
|
||||
ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id), status_code = VALUES(status_code)"
|
||||
"INSERT INTO pages (crawl_job_id, url, title, meta_description, status_code, " .
|
||||
"content_type, redirect_url, redirect_count) " .
|
||||
"VALUES (?, ?, ?, ?, ?, ?, ?, ?) " .
|
||||
"ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id), status_code = VALUES(status_code), " .
|
||||
"meta_description = VALUES(meta_description), redirect_url = VALUES(redirect_url), " .
|
||||
"redirect_count = VALUES(redirect_count)"
|
||||
);
|
||||
|
||||
$stmt->execute([$this->crawlJobId, $url, $title, $statusCode, $contentType]);
|
||||
$stmt->execute([
|
||||
$this->crawlJobId,
|
||||
$url,
|
||||
$title,
|
||||
$metaDescription,
|
||||
$statusCode,
|
||||
$contentType,
|
||||
$redirectUrl,
|
||||
$redirectCount
|
||||
]);
|
||||
$pageId = $this->db->lastInsertId();
|
||||
|
||||
// If pageId is 0, fetch it manually
|
||||
if ($pageId == 0) {
|
||||
if ($pageId == 0 || $pageId === '0') {
|
||||
$stmt = $this->db->prepare("SELECT id FROM pages WHERE crawl_job_id = ? AND url = ?");
|
||||
$stmt->execute([$this->crawlJobId, $url]);
|
||||
$pageId = $stmt->fetchColumn();
|
||||
$fetchedId = $stmt->fetchColumn();
|
||||
$pageId = is_numeric($fetchedId) ? (int)$fetchedId : 0;
|
||||
}
|
||||
|
||||
// Ensure pageId is an integer
|
||||
$pageId = is_numeric($pageId) ? (int)$pageId : 0;
|
||||
|
||||
// Extract and save links
|
||||
if (str_contains($contentType, 'text/html')) {
|
||||
if (str_contains($contentType, 'text/html') && $pageId > 0) {
|
||||
echo "Extracting links from: $url (pageId: $pageId)\n";
|
||||
$this->extractLinks($domCrawler, $url, $pageId, $depth);
|
||||
} else {
|
||||
@@ -155,7 +214,8 @@ class Crawler {
|
||||
$stmt->execute([$queueItem['id']]);
|
||||
}
|
||||
|
||||
private function extractLinks(DomCrawler $crawler, string $sourceUrl, int $pageId, int $depth): void {
|
||||
private function extractLinks(DomCrawler $crawler, string $sourceUrl, int $pageId, int $depth): void
|
||||
{
|
||||
$linkCount = 0;
|
||||
$crawler->filter('a')->each(function (DomCrawler $node) use ($sourceUrl, $pageId, $depth, &$linkCount) {
|
||||
try {
|
||||
@@ -176,13 +236,14 @@ class Crawler {
|
||||
$isNofollow = str_contains($rel, 'nofollow');
|
||||
|
||||
// Check if internal (same domain, no subdomains)
|
||||
$targetDomain = strtolower(parse_url($targetUrl, PHP_URL_HOST) ?? '');
|
||||
$targetHost = parse_url($targetUrl, PHP_URL_HOST);
|
||||
$targetDomain = strtolower($targetHost ?: '');
|
||||
$isInternal = ($targetDomain === $this->baseDomain);
|
||||
|
||||
// Save link
|
||||
$stmt = $this->db->prepare(
|
||||
"INSERT INTO links (page_id, crawl_job_id, source_url, target_url, link_text, is_nofollow, is_internal)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?)"
|
||||
"INSERT INTO links (page_id, crawl_job_id, source_url, target_url, " .
|
||||
"link_text, is_nofollow, is_internal) VALUES (?, ?, ?, ?, ?, ?, ?)"
|
||||
);
|
||||
$stmt->execute([
|
||||
$pageId,
|
||||
@@ -207,7 +268,8 @@ class Crawler {
|
||||
echo "Processed $linkCount links from $sourceUrl\n";
|
||||
}
|
||||
|
||||
private function makeAbsoluteUrl(string $url, string $base): string {
|
||||
private function makeAbsoluteUrl(string $url, string $base): string
|
||||
{
|
||||
if (filter_var($url, FILTER_VALIDATE_URL)) {
|
||||
return $url;
|
||||
}
|
||||
@@ -225,14 +287,20 @@ class Crawler {
|
||||
return "$scheme://$host$basePath$url";
|
||||
}
|
||||
|
||||
private function handleError(array $queueItem, $reason): void {
|
||||
/**
|
||||
* @param array{id: int, url: string, depth: int} $queueItem
|
||||
* @param \GuzzleHttp\Exception\RequestException $reason
|
||||
*/
|
||||
private function handleError(array $queueItem, $reason): void
|
||||
{
|
||||
$stmt = $this->db->prepare(
|
||||
"UPDATE crawl_queue SET status = 'failed', processed_at = NOW(), retry_count = retry_count + 1 WHERE id = ?"
|
||||
);
|
||||
$stmt->execute([$queueItem['id']]);
|
||||
}
|
||||
|
||||
private function updateJobStats(): void {
|
||||
private function updateJobStats(): void
|
||||
{
|
||||
$stmt = $this->db->prepare(
|
||||
"UPDATE crawl_jobs SET
|
||||
total_pages = (SELECT COUNT(*) FROM pages WHERE crawl_job_id = ?),
|
||||
@@ -242,7 +310,8 @@ class Crawler {
|
||||
$stmt->execute([$this->crawlJobId, $this->crawlJobId, $this->crawlJobId]);
|
||||
}
|
||||
|
||||
private function normalizeUrl(string $url): string {
|
||||
private function normalizeUrl(string $url): string
|
||||
{
|
||||
// Parse URL
|
||||
$parts = parse_url($url);
|
||||
|
||||
|
||||
@@ -1,16 +1,28 @@
|
||||
<?php
|
||||
|
||||
/**
|
||||
* Web Crawler - Database Class
|
||||
*
|
||||
* @copyright Copyright (c) 2025 Martin Kiesewetter
|
||||
* @author Martin Kiesewetter <mki@kies-media.de>
|
||||
* @link https://kies-media.de
|
||||
*/
|
||||
|
||||
namespace App;
|
||||
|
||||
use PDO;
|
||||
use PDOException;
|
||||
|
||||
class Database {
|
||||
class Database
|
||||
{
|
||||
private static ?PDO $instance = null;
|
||||
|
||||
private function __construct() {}
|
||||
private function __construct()
|
||||
{
|
||||
}
|
||||
|
||||
public static function getInstance(): PDO {
|
||||
public static function getInstance(): PDO
|
||||
{
|
||||
if (self::$instance === null) {
|
||||
try {
|
||||
self::$instance = new PDO(
|
||||
|
||||
@@ -1,4 +1,5 @@
|
||||
{
|
||||
"_comment": "Web Crawler - Composer Configuration | Copyright (c) 2025 Martin Kiesewetter <mki@kies-media.de> | https://kies-media.de",
|
||||
"name": "web-crawler/app",
|
||||
"description": "Web Crawler Application with Parallel Processing",
|
||||
"type": "project",
|
||||
@@ -9,7 +10,9 @@
|
||||
"symfony/css-selector": "^7.0"
|
||||
},
|
||||
"require-dev": {
|
||||
"phpunit/phpunit": "^11.0"
|
||||
"phpunit/phpunit": "^11.0",
|
||||
"phpstan/phpstan": "^2.1",
|
||||
"squizlabs/php_codesniffer": "^4.0"
|
||||
},
|
||||
"autoload": {
|
||||
"psr-4": {
|
||||
@@ -22,6 +25,9 @@
|
||||
}
|
||||
},
|
||||
"scripts": {
|
||||
"test": "phpunit"
|
||||
"test": "phpunit",
|
||||
"phpstan": "phpstan analyse -c ../phpstan.neon --memory-limit=512M",
|
||||
"phpcs": "phpcs --standard=PSR12 --ignore=/var/www/html/vendor /var/www/html /var/www/tests",
|
||||
"phpcbf": "phpcbf --standard=PSR12 --ignore=/var/www/html/vendor /var/www/html /var/www/tests"
|
||||
}
|
||||
}
|
||||
|
||||
134
src/composer.lock
generated
134
src/composer.lock
generated
@@ -4,7 +4,7 @@
|
||||
"Read more about it at https://getcomposer.org/doc/01-basic-usage.md#installing-dependencies",
|
||||
"This file is @generated automatically"
|
||||
],
|
||||
"content-hash": "96376d6cdbd0e0665e091abe3e0ef8d8",
|
||||
"content-hash": "bb0d5fc291c18a44bfc693b94b302357",
|
||||
"packages": [
|
||||
{
|
||||
"name": "guzzlehttp/guzzle",
|
||||
@@ -1211,6 +1211,59 @@
|
||||
},
|
||||
"time": "2022-02-21T01:04:05+00:00"
|
||||
},
|
||||
{
|
||||
"name": "phpstan/phpstan",
|
||||
"version": "2.1.30",
|
||||
"dist": {
|
||||
"type": "zip",
|
||||
"url": "https://api.github.com/repos/phpstan/phpstan/zipball/a4a7f159927983dd4f7c8020ed227d80b7f39d7d",
|
||||
"reference": "a4a7f159927983dd4f7c8020ed227d80b7f39d7d",
|
||||
"shasum": ""
|
||||
},
|
||||
"require": {
|
||||
"php": "^7.4|^8.0"
|
||||
},
|
||||
"conflict": {
|
||||
"phpstan/phpstan-shim": "*"
|
||||
},
|
||||
"bin": [
|
||||
"phpstan",
|
||||
"phpstan.phar"
|
||||
],
|
||||
"type": "library",
|
||||
"autoload": {
|
||||
"files": [
|
||||
"bootstrap.php"
|
||||
]
|
||||
},
|
||||
"notification-url": "https://packagist.org/downloads/",
|
||||
"license": [
|
||||
"MIT"
|
||||
],
|
||||
"description": "PHPStan - PHP Static Analysis Tool",
|
||||
"keywords": [
|
||||
"dev",
|
||||
"static analysis"
|
||||
],
|
||||
"support": {
|
||||
"docs": "https://phpstan.org/user-guide/getting-started",
|
||||
"forum": "https://github.com/phpstan/phpstan/discussions",
|
||||
"issues": "https://github.com/phpstan/phpstan/issues",
|
||||
"security": "https://github.com/phpstan/phpstan/security/policy",
|
||||
"source": "https://github.com/phpstan/phpstan-src"
|
||||
},
|
||||
"funding": [
|
||||
{
|
||||
"url": "https://github.com/ondrejmirtes",
|
||||
"type": "github"
|
||||
},
|
||||
{
|
||||
"url": "https://github.com/phpstan",
|
||||
"type": "github"
|
||||
}
|
||||
],
|
||||
"time": "2025-10-02T16:07:52+00:00"
|
||||
},
|
||||
{
|
||||
"name": "phpunit/php-code-coverage",
|
||||
"version": "11.0.11",
|
||||
@@ -2641,6 +2694,85 @@
|
||||
],
|
||||
"time": "2024-10-09T05:16:32+00:00"
|
||||
},
|
||||
{
|
||||
"name": "squizlabs/php_codesniffer",
|
||||
"version": "4.0.0",
|
||||
"source": {
|
||||
"type": "git",
|
||||
"url": "https://github.com/PHPCSStandards/PHP_CodeSniffer.git",
|
||||
"reference": "06113cfdaf117fc2165f9cd040bd0f17fcd5242d"
|
||||
},
|
||||
"dist": {
|
||||
"type": "zip",
|
||||
"url": "https://api.github.com/repos/PHPCSStandards/PHP_CodeSniffer/zipball/06113cfdaf117fc2165f9cd040bd0f17fcd5242d",
|
||||
"reference": "06113cfdaf117fc2165f9cd040bd0f17fcd5242d",
|
||||
"shasum": ""
|
||||
},
|
||||
"require": {
|
||||
"ext-simplexml": "*",
|
||||
"ext-tokenizer": "*",
|
||||
"ext-xmlwriter": "*",
|
||||
"php": ">=7.2.0"
|
||||
},
|
||||
"require-dev": {
|
||||
"phpunit/phpunit": "^8.4.0 || ^9.3.4 || ^10.5.32 || 11.3.3 - 11.5.28 || ^11.5.31"
|
||||
},
|
||||
"bin": [
|
||||
"bin/phpcbf",
|
||||
"bin/phpcs"
|
||||
],
|
||||
"type": "library",
|
||||
"notification-url": "https://packagist.org/downloads/",
|
||||
"license": [
|
||||
"BSD-3-Clause"
|
||||
],
|
||||
"authors": [
|
||||
{
|
||||
"name": "Greg Sherwood",
|
||||
"role": "Former lead"
|
||||
},
|
||||
{
|
||||
"name": "Juliette Reinders Folmer",
|
||||
"role": "Current lead"
|
||||
},
|
||||
{
|
||||
"name": "Contributors",
|
||||
"homepage": "https://github.com/PHPCSStandards/PHP_CodeSniffer/graphs/contributors"
|
||||
}
|
||||
],
|
||||
"description": "PHP_CodeSniffer tokenizes PHP files and detects violations of a defined set of coding standards.",
|
||||
"homepage": "https://github.com/PHPCSStandards/PHP_CodeSniffer",
|
||||
"keywords": [
|
||||
"phpcs",
|
||||
"standards",
|
||||
"static analysis"
|
||||
],
|
||||
"support": {
|
||||
"issues": "https://github.com/PHPCSStandards/PHP_CodeSniffer/issues",
|
||||
"security": "https://github.com/PHPCSStandards/PHP_CodeSniffer/security/policy",
|
||||
"source": "https://github.com/PHPCSStandards/PHP_CodeSniffer",
|
||||
"wiki": "https://github.com/PHPCSStandards/PHP_CodeSniffer/wiki"
|
||||
},
|
||||
"funding": [
|
||||
{
|
||||
"url": "https://github.com/PHPCSStandards",
|
||||
"type": "github"
|
||||
},
|
||||
{
|
||||
"url": "https://github.com/jrfnl",
|
||||
"type": "github"
|
||||
},
|
||||
{
|
||||
"url": "https://opencollective.com/php_codesniffer",
|
||||
"type": "open_collective"
|
||||
},
|
||||
{
|
||||
"url": "https://thanks.dev/u/gh/phpcsstandards",
|
||||
"type": "thanks_dev"
|
||||
}
|
||||
],
|
||||
"time": "2025-09-15T11:28:58+00:00"
|
||||
},
|
||||
{
|
||||
"name": "staabm/side-effects-detector",
|
||||
"version": "1.0.5",
|
||||
|
||||
@@ -1,6 +1,14 @@
|
||||
#!/usr/bin/env php
|
||||
<?php
|
||||
|
||||
/**
|
||||
* Web Crawler - Background Worker
|
||||
*
|
||||
* @copyright Copyright (c) 2025 Martin Kiesewetter
|
||||
* @author Martin Kiesewetter <mki@kies-media.de>
|
||||
* @link https://kies-media.de
|
||||
*/
|
||||
|
||||
require_once __DIR__ . '/vendor/autoload.php';
|
||||
|
||||
use App\Database;
|
||||
|
||||
412
src/index.php
412
src/index.php
@@ -1,9 +1,28 @@
|
||||
<!DOCTYPE html>
|
||||
<!--
|
||||
/**
|
||||
* Web Crawler - Main Interface
|
||||
*
|
||||
* @copyright Copyright (c) 2025 Martin Kiesewetter
|
||||
* @author Martin Kiesewetter <mki@kies-media.de>
|
||||
* @link https://kies-media.de
|
||||
*/
|
||||
-->
|
||||
<html lang="de">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Web Crawler</title>
|
||||
|
||||
<!-- jQuery -->
|
||||
<script src="https://code.jquery.com/jquery-3.7.1.min.js"></script>
|
||||
|
||||
<!-- DataTables CSS -->
|
||||
<link rel="stylesheet" href="https://cdn.datatables.net/1.13.7/css/jquery.dataTables.min.css">
|
||||
|
||||
<!-- DataTables JS -->
|
||||
<script src="https://cdn.datatables.net/1.13.7/js/jquery.dataTables.min.js"></script>
|
||||
|
||||
<style>
|
||||
* {
|
||||
margin: 0;
|
||||
@@ -198,6 +217,58 @@
|
||||
text-overflow: ellipsis;
|
||||
white-space: nowrap;
|
||||
}
|
||||
|
||||
/* DataTables Styling */
|
||||
.dataTables_wrapper {
|
||||
padding: 20px 0;
|
||||
}
|
||||
|
||||
.dataTables_filter input {
|
||||
padding: 8px;
|
||||
border: 2px solid #e0e0e0;
|
||||
border-radius: 6px;
|
||||
margin-left: 10px;
|
||||
}
|
||||
|
||||
.dataTables_length select {
|
||||
padding: 6px;
|
||||
border: 2px solid #e0e0e0;
|
||||
border-radius: 6px;
|
||||
margin: 0 10px;
|
||||
}
|
||||
|
||||
.dataTables_info {
|
||||
padding-top: 10px;
|
||||
color: #7f8c8d;
|
||||
}
|
||||
|
||||
.dataTables_paginate {
|
||||
padding-top: 10px;
|
||||
}
|
||||
|
||||
.dataTables_paginate .paginate_button {
|
||||
padding: 6px 12px;
|
||||
margin: 0 2px;
|
||||
border: 1px solid #e0e0e0;
|
||||
border-radius: 4px;
|
||||
background: white;
|
||||
cursor: pointer;
|
||||
}
|
||||
|
||||
.dataTables_paginate .paginate_button.current {
|
||||
background: #3498db;
|
||||
color: white !important;
|
||||
border-color: #3498db;
|
||||
}
|
||||
|
||||
.dataTables_paginate .paginate_button:hover {
|
||||
background: #ecf0f1;
|
||||
}
|
||||
|
||||
.dataTables_paginate .paginate_button.disabled {
|
||||
cursor: not-allowed;
|
||||
opacity: 0.5;
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
@@ -214,7 +285,7 @@
|
||||
|
||||
<div class="card">
|
||||
<h2>Crawl Jobs</h2>
|
||||
<table id="jobsTable">
|
||||
<table id="jobsTable" class="display">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>ID</th>
|
||||
@@ -241,10 +312,13 @@
|
||||
<div class="tabs">
|
||||
<button class="tab active" onclick="switchTab('pages')">Seiten</button>
|
||||
<button class="tab" onclick="switchTab('links')">Links</button>
|
||||
<button class="tab" onclick="switchTab('broken')">Broken Links</button>
|
||||
<button class="tab" onclick="switchTab('redirects')">Redirects</button>
|
||||
<button class="tab" onclick="switchTab('seo')">SEO Analysis</button>
|
||||
</div>
|
||||
|
||||
<div class="tab-content active" id="pages-tab">
|
||||
<table>
|
||||
<table id="pagesTable" class="display">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>URL</th>
|
||||
@@ -260,7 +334,7 @@
|
||||
</div>
|
||||
|
||||
<div class="tab-content" id="links-tab">
|
||||
<table>
|
||||
<table id="linksTable" class="display">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Von</th>
|
||||
@@ -275,6 +349,62 @@
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
<div class="tab-content" id="broken-tab">
|
||||
<table id="brokenTable" class="display">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>URL</th>
|
||||
<th>Status Code</th>
|
||||
<th>Titel</th>
|
||||
<th>Gecrawlt</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody id="brokenBody">
|
||||
<tr><td colspan="4" class="loading">Keine defekten Links gefunden</td></tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
<div class="tab-content" id="redirects-tab">
|
||||
<h3>Redirect Statistics</h3>
|
||||
<div id="redirectStats" class="stats" style="margin-bottom: 20px;"></div>
|
||||
<table id="redirectsTable" class="display">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>URL</th>
|
||||
<th>Redirect To</th>
|
||||
<th>Status Code</th>
|
||||
<th>Redirect Count</th>
|
||||
<th>Type</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody id="redirectsBody">
|
||||
<tr><td colspan="5" class="loading">Keine Redirects gefunden</td></tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
<div class="tab-content" id="seo-tab">
|
||||
<h3>SEO Issues</h3>
|
||||
<div id="seoStats" style="margin-bottom: 20px;"></div>
|
||||
<table id="seoTable" class="display">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>URL</th>
|
||||
<th>Title (Länge)</th>
|
||||
<th>Meta Description (Länge)</th>
|
||||
<th>Issues</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody id="seoIssuesBody">
|
||||
<tr><td colspan="4" class="loading">Keine SEO-Probleme gefunden</td></tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<h3 style="margin-top: 30px;">Duplicate Content</h3>
|
||||
<div id="seoDuplicatesBody"></div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
@@ -312,12 +442,19 @@
|
||||
}
|
||||
}
|
||||
|
||||
let jobsDataTable = null;
|
||||
|
||||
async function loadJobs() {
|
||||
try {
|
||||
const response = await fetch('/api.php?action=jobs');
|
||||
const data = await response.json();
|
||||
|
||||
if (data.success) {
|
||||
// Destroy existing DataTable if it exists
|
||||
if (jobsDataTable) {
|
||||
jobsDataTable.destroy();
|
||||
}
|
||||
|
||||
const tbody = document.getElementById('jobsBody');
|
||||
tbody.innerHTML = data.jobs.map(job => `
|
||||
<tr>
|
||||
@@ -329,10 +466,30 @@
|
||||
<td>${job.started_at || '-'}</td>
|
||||
<td>
|
||||
<button class="action-btn" onclick="viewJob(${job.id})">Ansehen</button>
|
||||
<button class="action-btn" onclick="recrawlJob(${job.id}, '${job.domain}')">Recrawl</button>
|
||||
<button class="action-btn" onclick="deleteJob(${job.id})">Löschen</button>
|
||||
</td>
|
||||
</tr>
|
||||
`).join('');
|
||||
|
||||
// Initialize DataTable
|
||||
jobsDataTable = $('#jobsTable').DataTable({
|
||||
pageLength: 25,
|
||||
order: [[0, 'desc']],
|
||||
language: {
|
||||
search: 'Suchen:',
|
||||
lengthMenu: 'Zeige _MENU_ Einträge',
|
||||
info: 'Zeige _START_ bis _END_ von _TOTAL_ Einträgen',
|
||||
infoEmpty: 'Keine Einträge verfügbar',
|
||||
infoFiltered: '(gefiltert von _MAX_ Einträgen)',
|
||||
paginate: {
|
||||
first: 'Erste',
|
||||
last: 'Letzte',
|
||||
next: 'Nächste',
|
||||
previous: 'Vorherige'
|
||||
}
|
||||
}
|
||||
});
|
||||
}
|
||||
} catch (e) {
|
||||
console.error('Fehler beim Laden der Jobs:', e);
|
||||
@@ -404,6 +561,10 @@
|
||||
const pagesResponse = await fetch(`/api.php?action=pages&job_id=${currentJobId}`);
|
||||
const pagesData = await pagesResponse.json();
|
||||
|
||||
if ($.fn.DataTable.isDataTable('#pagesTable')) {
|
||||
$('#pagesTable').DataTable().destroy();
|
||||
}
|
||||
|
||||
if (pagesData.success && pagesData.pages.length > 0) {
|
||||
document.getElementById('pagesBody').innerHTML = pagesData.pages.map(page => `
|
||||
<tr>
|
||||
@@ -413,12 +574,33 @@
|
||||
<td>${page.crawled_at}</td>
|
||||
</tr>
|
||||
`).join('');
|
||||
|
||||
$('#pagesTable').DataTable({
|
||||
pageLength: 50,
|
||||
language: {
|
||||
search: 'Suchen:',
|
||||
lengthMenu: 'Zeige _MENU_ Einträge',
|
||||
info: 'Zeige _START_ bis _END_ von _TOTAL_ Einträgen',
|
||||
infoEmpty: 'Keine Einträge verfügbar',
|
||||
infoFiltered: '(gefiltert von _MAX_ Einträgen)',
|
||||
paginate: {
|
||||
first: 'Erste',
|
||||
last: 'Letzte',
|
||||
next: 'Nächste',
|
||||
previous: 'Vorherige'
|
||||
}
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
// Load links
|
||||
const linksResponse = await fetch(`/api.php?action=links&job_id=${currentJobId}`);
|
||||
const linksData = await linksResponse.json();
|
||||
|
||||
if ($.fn.DataTable.isDataTable('#linksTable')) {
|
||||
$('#linksTable').DataTable().destroy();
|
||||
}
|
||||
|
||||
if (linksData.success && linksData.links.length > 0) {
|
||||
document.getElementById('linksBody').innerHTML = linksData.links.map(link => `
|
||||
<tr>
|
||||
@@ -429,6 +611,205 @@
|
||||
<td>${link.is_internal ? 'Intern' : '<span class="external">Extern</span>'}</td>
|
||||
</tr>
|
||||
`).join('');
|
||||
|
||||
$('#linksTable').DataTable({
|
||||
pageLength: 50,
|
||||
language: {
|
||||
search: 'Suchen:',
|
||||
lengthMenu: 'Zeige _MENU_ Einträge',
|
||||
info: 'Zeige _START_ bis _END_ von _TOTAL_ Einträgen',
|
||||
infoEmpty: 'Keine Einträge verfügbar',
|
||||
infoFiltered: '(gefiltert von _MAX_ Einträgen)',
|
||||
paginate: {
|
||||
first: 'Erste',
|
||||
last: 'Letzte',
|
||||
next: 'Nächste',
|
||||
previous: 'Vorherige'
|
||||
}
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
// Load broken links
|
||||
const brokenResponse = await fetch(`/api.php?action=broken-links&job_id=${currentJobId}`);
|
||||
const brokenData = await brokenResponse.json();
|
||||
|
||||
if ($.fn.DataTable.isDataTable('#brokenTable')) {
|
||||
$('#brokenTable').DataTable().destroy();
|
||||
}
|
||||
|
||||
if (brokenData.success && brokenData.broken_links.length > 0) {
|
||||
document.getElementById('brokenBody').innerHTML = brokenData.broken_links.map(page => `
|
||||
<tr>
|
||||
<td class="url-cell" title="${page.url}">${page.url}</td>
|
||||
<td><span class="status failed">${page.status_code || 'Error'}</span></td>
|
||||
<td>${page.title || '-'}</td>
|
||||
<td>${page.crawled_at}</td>
|
||||
</tr>
|
||||
`).join('');
|
||||
|
||||
$('#brokenTable').DataTable({
|
||||
pageLength: 25,
|
||||
language: {
|
||||
search: 'Suchen:',
|
||||
lengthMenu: 'Zeige _MENU_ Einträge',
|
||||
info: 'Zeige _START_ bis _END_ von _TOTAL_ Einträgen',
|
||||
infoEmpty: 'Keine Einträge verfügbar',
|
||||
infoFiltered: '(gefiltert von _MAX_ Einträgen)',
|
||||
paginate: {
|
||||
first: 'Erste',
|
||||
last: 'Letzte',
|
||||
next: 'Nächste',
|
||||
previous: 'Vorherige'
|
||||
}
|
||||
}
|
||||
});
|
||||
} else {
|
||||
document.getElementById('brokenBody').innerHTML = '<tr><td colspan="4" class="loading">Keine defekten Links gefunden</td></tr>';
|
||||
}
|
||||
|
||||
// Load SEO analysis
|
||||
const seoResponse = await fetch(`/api.php?action=seo-analysis&job_id=${currentJobId}`);
|
||||
const seoData = await seoResponse.json();
|
||||
|
||||
if (seoData.success) {
|
||||
// SEO Stats
|
||||
document.getElementById('seoStats').innerHTML = `
|
||||
<div class="stat-box">
|
||||
<div class="stat-label">Total Pages</div>
|
||||
<div class="stat-value">${seoData.total_pages}</div>
|
||||
</div>
|
||||
<div class="stat-box">
|
||||
<div class="stat-label">Pages with Issues</div>
|
||||
<div class="stat-value">${seoData.issues.length}</div>
|
||||
</div>
|
||||
<div class="stat-box">
|
||||
<div class="stat-label">Duplicates Found</div>
|
||||
<div class="stat-value">${seoData.duplicates.length}</div>
|
||||
</div>
|
||||
`;
|
||||
|
||||
// SEO Issues
|
||||
if ($.fn.DataTable.isDataTable('#seoTable')) {
|
||||
$('#seoTable').DataTable().destroy();
|
||||
}
|
||||
|
||||
if (seoData.issues.length > 0) {
|
||||
document.getElementById('seoIssuesBody').innerHTML = seoData.issues.map(item => `
|
||||
<tr>
|
||||
<td class="url-cell" title="${item.url}">${item.url}</td>
|
||||
<td>${item.title || '-'} (${item.title_length})</td>
|
||||
<td>${item.meta_description ? item.meta_description.substring(0, 50) + '...' : '-'} (${item.meta_length})</td>
|
||||
<td><span class="nofollow">${item.issues.join(', ')}</span></td>
|
||||
</tr>
|
||||
`).join('');
|
||||
|
||||
$('#seoTable').DataTable({
|
||||
pageLength: 25,
|
||||
language: {
|
||||
search: 'Suchen:',
|
||||
lengthMenu: 'Zeige _MENU_ Einträge',
|
||||
info: 'Zeige _START_ bis _END_ von _TOTAL_ Einträgen',
|
||||
infoEmpty: 'Keine Einträge verfügbar',
|
||||
infoFiltered: '(gefiltert von _MAX_ Einträgen)',
|
||||
paginate: {
|
||||
first: 'Erste',
|
||||
last: 'Letzte',
|
||||
next: 'Nächste',
|
||||
previous: 'Vorherige'
|
||||
}
|
||||
}
|
||||
});
|
||||
} else {
|
||||
document.getElementById('seoIssuesBody').innerHTML = '<tr><td colspan="4" class="loading">Keine SEO-Probleme gefunden</td></tr>';
|
||||
}
|
||||
|
||||
// Duplicates
|
||||
if (seoData.duplicates.length > 0) {
|
||||
document.getElementById('seoDuplicatesBody').innerHTML = seoData.duplicates.map(dup => `
|
||||
<div class="stat-box" style="margin-bottom: 15px;">
|
||||
<div class="stat-label">Duplicate ${dup.type}</div>
|
||||
<div style="font-size: 14px; margin: 10px 0;"><strong>${dup.content}</strong></div>
|
||||
<div style="font-size: 12px;">Found on ${dup.urls.length} pages:</div>
|
||||
<ul style="margin-top: 5px; font-size: 12px;">
|
||||
${dup.urls.map(url => `<li>${url}</li>`).join('')}
|
||||
</ul>
|
||||
</div>
|
||||
`).join('');
|
||||
} else {
|
||||
document.getElementById('seoDuplicatesBody').innerHTML = '<p>Keine doppelten Inhalte gefunden</p>';
|
||||
}
|
||||
}
|
||||
|
||||
// Load redirects
|
||||
const redirectsResponse = await fetch(`/api.php?action=redirects&job_id=${currentJobId}`);
|
||||
const redirectsData = await redirectsResponse.json();
|
||||
|
||||
if (redirectsData.success) {
|
||||
const stats = redirectsData.stats;
|
||||
|
||||
// Redirect Stats
|
||||
document.getElementById('redirectStats').innerHTML = `
|
||||
<div class="stat-box">
|
||||
<div class="stat-label">Total Redirects</div>
|
||||
<div class="stat-value">${stats.total}</div>
|
||||
</div>
|
||||
<div class="stat-box">
|
||||
<div class="stat-label">Permanent (301/308)</div>
|
||||
<div class="stat-value">${stats.permanent}</div>
|
||||
</div>
|
||||
<div class="stat-box">
|
||||
<div class="stat-label">Temporary (302/303/307)</div>
|
||||
<div class="stat-value">${stats.temporary}</div>
|
||||
</div>
|
||||
<div class="stat-box">
|
||||
<div class="stat-label">Excessive (>${stats.threshold})</div>
|
||||
<div class="stat-value" style="color: ${stats.excessive > 0 ? '#e74c3c' : '#27ae60'}">${stats.excessive}</div>
|
||||
<div class="stat-sublabel">threshold: ${stats.threshold}</div>
|
||||
</div>
|
||||
`;
|
||||
|
||||
// Redirect Table
|
||||
if ($.fn.DataTable.isDataTable('#redirectsTable')) {
|
||||
$('#redirectsTable').DataTable().destroy();
|
||||
}
|
||||
|
||||
if (redirectsData.redirects.length > 0) {
|
||||
document.getElementById('redirectsBody').innerHTML = redirectsData.redirects.map(redirect => {
|
||||
const isExcessive = redirect.redirect_count > stats.threshold;
|
||||
const isPermRedirect = redirect.status_code == 301 || redirect.status_code == 308;
|
||||
const redirectType = isPermRedirect ? 'Permanent' : 'Temporary';
|
||||
|
||||
return `
|
||||
<tr style="${isExcessive ? 'background-color: #fff3cd;' : ''}">
|
||||
<td class="url-cell" title="${redirect.url}">${redirect.url}</td>
|
||||
<td class="url-cell" title="${redirect.redirect_url || '-'}">${redirect.redirect_url || '-'}</td>
|
||||
<td><span class="status ${isPermRedirect ? 'completed' : 'running'}">${redirect.status_code}</span></td>
|
||||
<td><strong ${isExcessive ? 'style="color: #e74c3c;"' : ''}>${redirect.redirect_count}</strong></td>
|
||||
<td>${redirectType}</td>
|
||||
</tr>
|
||||
`;
|
||||
}).join('');
|
||||
|
||||
$('#redirectsTable').DataTable({
|
||||
pageLength: 25,
|
||||
language: {
|
||||
search: 'Suchen:',
|
||||
lengthMenu: 'Zeige _MENU_ Einträge',
|
||||
info: 'Zeige _START_ bis _END_ von _TOTAL_ Einträgen',
|
||||
infoEmpty: 'Keine Einträge verfügbar',
|
||||
infoFiltered: '(gefiltert von _MAX_ Einträgen)',
|
||||
paginate: {
|
||||
first: 'Erste',
|
||||
last: 'Letzte',
|
||||
next: 'Nächste',
|
||||
previous: 'Vorherige'
|
||||
}
|
||||
}
|
||||
});
|
||||
} else {
|
||||
document.getElementById('redirectsBody').innerHTML = '<tr><td colspan="5" class="loading">Keine Redirects gefunden</td></tr>';
|
||||
}
|
||||
}
|
||||
|
||||
// Update jobs table
|
||||
@@ -463,6 +844,31 @@
|
||||
}
|
||||
}
|
||||
|
||||
async function recrawlJob(jobId, domain) {
|
||||
if (!confirm('Job-Ergebnisse löschen und neu crawlen?')) return;
|
||||
|
||||
const formData = new FormData();
|
||||
formData.append('job_id', jobId);
|
||||
formData.append('domain', domain);
|
||||
|
||||
try {
|
||||
const response = await fetch('/api.php?action=recrawl', {
|
||||
method: 'POST',
|
||||
body: formData
|
||||
});
|
||||
const data = await response.json();
|
||||
|
||||
if (data.success) {
|
||||
loadJobs();
|
||||
alert('Recrawl gestartet! Job ID: ' + data.job_id);
|
||||
} else {
|
||||
alert('Fehler: ' + data.error);
|
||||
}
|
||||
} catch (e) {
|
||||
alert('Fehler beim Recrawl: ' + e.message);
|
||||
}
|
||||
}
|
||||
|
||||
function switchTab(tab) {
|
||||
document.querySelectorAll('.tab').forEach(t => t.classList.remove('active'));
|
||||
document.querySelectorAll('.tab-content').forEach(c => c.classList.remove('active'));
|
||||
|
||||
@@ -18,7 +18,8 @@ class CrawlerIntegrationTest extends TestCase
|
||||
// Create a test job
|
||||
$stmt = $this->db->prepare("INSERT INTO crawl_jobs (domain, status) VALUES (?, 'pending')");
|
||||
$stmt->execute(['https://httpbin.org']);
|
||||
$this->testJobId = $this->db->lastInsertId();
|
||||
$lastId = $this->db->lastInsertId();
|
||||
$this->testJobId = is_numeric($lastId) ? (int)$lastId : 0;
|
||||
}
|
||||
|
||||
protected function tearDown(): void
|
||||
|
||||
@@ -17,7 +17,8 @@ class CrawlerTest extends TestCase
|
||||
// Create a test job
|
||||
$stmt = $db->prepare("INSERT INTO crawl_jobs (domain, status) VALUES (?, 'pending')");
|
||||
$stmt->execute(['https://example.com']);
|
||||
$this->testJobId = $db->lastInsertId();
|
||||
$lastId = $db->lastInsertId();
|
||||
$this->testJobId = is_numeric($lastId) ? (int)$lastId : 0;
|
||||
}
|
||||
|
||||
protected function tearDown(): void
|
||||
|
||||
@@ -42,6 +42,7 @@ class DatabaseTest extends TestCase
|
||||
{
|
||||
$db = Database::getInstance();
|
||||
$stmt = $db->query('SELECT 1 as test');
|
||||
$this->assertNotFalse($stmt, 'Query failed');
|
||||
$result = $stmt->fetch();
|
||||
|
||||
$this->assertEquals(['test' => 1], $result);
|
||||
|
||||
530
webanalyse.php
530
webanalyse.php
@@ -1,530 +0,0 @@
|
||||
<?php
|
||||
|
||||
declare(strict_types=1);
|
||||
|
||||
/**
|
||||
* Koordiniert Webseiten-Crawls und persistiert Antwortdaten in der Screaming Frog Datenbank.
|
||||
*/
|
||||
class WebAnalyse
|
||||
{
|
||||
private const USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36';
|
||||
private const CURL_TIMEOUT = 30;
|
||||
|
||||
/**
|
||||
* @var mysqli Verbindung zur Screaming Frog Datenbank.
|
||||
*/
|
||||
private mysqli $db;
|
||||
|
||||
public function __construct(?mysqli $connection = null)
|
||||
{
|
||||
$connection ??= mysqli_connect('localhost', 'root', '', 'screaming_frog');
|
||||
|
||||
if (!$connection instanceof mysqli) {
|
||||
throw new RuntimeException('Verbindung zur Datenbank konnte nicht hergestellt werden: ' . mysqli_connect_error());
|
||||
}
|
||||
|
||||
$connection->set_charset('utf8mb4');
|
||||
$this->db = $connection;
|
||||
}
|
||||
|
||||
/**
|
||||
* Holt eine einzelne URL und gibt Response-Metadaten zurueck.
|
||||
*
|
||||
* @param string $url Zieladresse fuer den Abruf.
|
||||
* @return array<string,mixed> Antwortdaten oder ein "error"-Schluessel.
|
||||
*/
|
||||
public function getWebsite(string $url): array
|
||||
{
|
||||
$handle = $this->createCurlHandle($url);
|
||||
$response = curl_exec($handle);
|
||||
|
||||
if ($response === false) {
|
||||
$error = curl_error($handle);
|
||||
curl_close($handle);
|
||||
return ['error' => $error];
|
||||
}
|
||||
|
||||
$info = curl_getinfo($handle);
|
||||
curl_close($handle);
|
||||
|
||||
return $this->buildResponsePayload($response, $info);
|
||||
}
|
||||
|
||||
/**
|
||||
* Ruft mehrere URLs parallel via curl_multi ab.
|
||||
*
|
||||
* @param array<int,string> $urls Liste von Ziel-URLs.
|
||||
* @return array<string,array<string,mixed>> Antworten je URL.
|
||||
*/
|
||||
public function getMultipleWebsites(array $urls): array
|
||||
{
|
||||
if ($urls === []) {
|
||||
return [];
|
||||
}
|
||||
|
||||
$results = [];
|
||||
$multiHandle = curl_multi_init();
|
||||
$handles = [];
|
||||
|
||||
foreach ($urls as $url) {
|
||||
$handle = $this->createCurlHandle($url);
|
||||
$handles[$url] = $handle;
|
||||
curl_multi_add_handle($multiHandle, $handle);
|
||||
}
|
||||
|
||||
$running = null;
|
||||
do {
|
||||
$status = curl_multi_exec($multiHandle, $running);
|
||||
} while ($status === CURLM_CALL_MULTI_PERFORM);
|
||||
|
||||
while ($running && $status === CURLM_OK) {
|
||||
if (curl_multi_select($multiHandle, 1.0) === -1) {
|
||||
usleep(100000);
|
||||
}
|
||||
|
||||
do {
|
||||
$status = curl_multi_exec($multiHandle, $running);
|
||||
} while ($status === CURLM_CALL_MULTI_PERFORM);
|
||||
}
|
||||
|
||||
foreach ($handles as $url => $handle) {
|
||||
$response = curl_multi_getcontent($handle);
|
||||
|
||||
if ($response === false) {
|
||||
$results[$url] = ['error' => curl_error($handle)];
|
||||
} else {
|
||||
$results[$url] = $this->buildResponsePayload($response, curl_getinfo($handle));
|
||||
}
|
||||
|
||||
curl_multi_remove_handle($multiHandle, $handle);
|
||||
curl_close($handle);
|
||||
}
|
||||
|
||||
curl_multi_close($multiHandle);
|
||||
|
||||
return $results;
|
||||
}
|
||||
|
||||
/**
|
||||
* Persistiert Response-Daten und stoesst die Analyse der gefundenen Links an.
|
||||
*
|
||||
* @param int $crawlID Identifier der Crawl-Session.
|
||||
* @param string $url Ursprung-URL, deren Antwort verarbeitet wird.
|
||||
* @param array<string,mixed> $data Ergebnis der HTTP-Abfrage.
|
||||
*/
|
||||
public function processResults(int $crawlID, string $url, array $data): void
|
||||
{
|
||||
if (isset($data['error'])) {
|
||||
error_log(sprintf('Fehler bei der Analyse von %s: %s', $url, $data['error']));
|
||||
return;
|
||||
}
|
||||
|
||||
$body = (string)($data['body'] ?? '');
|
||||
|
||||
$update = $this->db->prepare(
|
||||
'UPDATE urls
|
||||
SET status_code = ?, response_time = ?, body_size = ?, date = NOW(), body = ?
|
||||
WHERE url = ? AND crawl_id = ?
|
||||
LIMIT 1'
|
||||
);
|
||||
|
||||
if ($update === false) {
|
||||
throw new RuntimeException('Update-Statement konnte nicht vorbereitet werden: ' . $this->db->error);
|
||||
}
|
||||
|
||||
$statusCode = (int)($data['status_code'] ?? 0);
|
||||
$responseTimeMs = (int)round(((float)($data['response_time'] ?? 0)) * 1000);
|
||||
$bodySize = (int)($data['body_size'] ?? strlen($body));
|
||||
|
||||
$update->bind_param('iiissi', $statusCode, $responseTimeMs, $bodySize, $body, $url, $crawlID);
|
||||
$update->execute();
|
||||
$update->close();
|
||||
|
||||
$this->findNewUrls($crawlID, $body, $url);
|
||||
}
|
||||
|
||||
/**
|
||||
* Extrahiert Links aus einer Antwort und legt neue URL-Datensaetze an.
|
||||
*
|
||||
* @param int $crawlID Identifier der Crawl-Session.
|
||||
* @param string $body HTML-Koerper der Antwort.
|
||||
* @param string $url Bearbeitete URL, dient als Kontext fuer relative Links.
|
||||
*/
|
||||
public function findNewUrls(int $crawlID, string $body, string $url): void
|
||||
{
|
||||
if ($body === '') {
|
||||
return;
|
||||
}
|
||||
|
||||
$links = $this->extractLinks($body, $url);
|
||||
if ($links === []) {
|
||||
return;
|
||||
}
|
||||
|
||||
$originId = $this->resolveUrlId($crawlID, $url);
|
||||
if ($originId === null) {
|
||||
return;
|
||||
}
|
||||
|
||||
$deleteLinksStmt = $this->db->prepare('DELETE FROM links WHERE von = ?');
|
||||
if ($deleteLinksStmt !== false) {
|
||||
$deleteLinksStmt->bind_param('i', $originId);
|
||||
$deleteLinksStmt->execute();
|
||||
$deleteLinksStmt->close();
|
||||
}
|
||||
|
||||
$insertUrlStmt = $this->db->prepare('INSERT IGNORE INTO urls (url, crawl_id) VALUES (?, ?)');
|
||||
$selectUrlStmt = $this->db->prepare('SELECT id FROM urls WHERE url = ? AND crawl_id = ? LIMIT 1');
|
||||
$insertLinkStmt = $this->db->prepare('INSERT IGNORE INTO links (von, nach, linktext, dofollow) VALUES (?, ?, ?, ?)');
|
||||
|
||||
if (!$insertUrlStmt || !$selectUrlStmt || !$insertLinkStmt) {
|
||||
throw new RuntimeException('Vorbereitete Statements konnten nicht erstellt werden: ' . $this->db->error);
|
||||
}
|
||||
|
||||
foreach ($links as $link) {
|
||||
$absoluteUrl = (string)$link['absolute_url'];
|
||||
|
||||
$insertUrlStmt->bind_param('si', $absoluteUrl, $crawlID);
|
||||
$insertUrlStmt->execute();
|
||||
|
||||
$targetId = $this->db->insert_id;
|
||||
if ($targetId === 0) {
|
||||
$selectUrlStmt->bind_param('si', $absoluteUrl, $crawlID);
|
||||
$selectUrlStmt->execute();
|
||||
$result = $selectUrlStmt->get_result();
|
||||
$targetId = $result ? (int)($result->fetch_assoc()['id'] ?? 0) : 0;
|
||||
}
|
||||
|
||||
if ($targetId === 0) {
|
||||
continue;
|
||||
}
|
||||
|
||||
$linkText = $this->normaliseText((string)($link['text'] ?? ''));
|
||||
$isFollow = (int)(strpos((string)($link['rel'] ?? ''), 'nofollow') !== false ? 0 : 1);
|
||||
|
||||
$insertLinkStmt->bind_param('iisi', $originId, $targetId, $linkText, $isFollow);
|
||||
$insertLinkStmt->execute();
|
||||
}
|
||||
|
||||
$insertUrlStmt->close();
|
||||
$selectUrlStmt->close();
|
||||
$insertLinkStmt->close();
|
||||
}
|
||||
|
||||
/**
|
||||
* Startet einen Crawl-Durchlauf fuer unbehandelte URLs.
|
||||
*
|
||||
* @param int $crawlID Identifier der Crawl-Session.
|
||||
*/
|
||||
public function doCrawl(int $crawlID): void
|
||||
{
|
||||
$statement = $this->db->prepare(
|
||||
'SELECT url FROM urls WHERE crawl_id = ? AND date IS NULL LIMIT 50'
|
||||
);
|
||||
|
||||
if ($statement === false) {
|
||||
return;
|
||||
}
|
||||
|
||||
$statement->bind_param('i', $crawlID);
|
||||
$statement->execute();
|
||||
$result = $statement->get_result();
|
||||
|
||||
if (!$result instanceof mysqli_result) {
|
||||
$statement->close();
|
||||
return;
|
||||
}
|
||||
|
||||
$urls = [];
|
||||
while ($row = $result->fetch_assoc()) {
|
||||
$urls[] = $row['url'];
|
||||
}
|
||||
|
||||
$result->free();
|
||||
$statement->close();
|
||||
|
||||
if ($urls === []) {
|
||||
return;
|
||||
}
|
||||
|
||||
foreach ($this->getMultipleWebsites($urls) as $url => $data) {
|
||||
$this->processResults($crawlID, $url, $data);
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Parst HTML-Inhalt und liefert eine strukturierte Liste gefundener Links.
|
||||
*
|
||||
* @param string $html Rohes HTML-Dokument.
|
||||
* @param string $baseUrl Basis-URL fuer die Aufloesung relativer Pfade.
|
||||
* @return array<int,array<string,mixed>> Gesammelte Linkdaten.
|
||||
*/
|
||||
public function extractLinks(string $html, string $baseUrl = ''): array
|
||||
{
|
||||
$links = [];
|
||||
|
||||
$dom = new DOMDocument();
|
||||
$previous = libxml_use_internal_errors(true);
|
||||
$dom->loadHTML('<?xml encoding="UTF-8">' . $html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
|
||||
libxml_clear_errors();
|
||||
libxml_use_internal_errors($previous);
|
||||
|
||||
foreach ($dom->getElementsByTagName('a') as $index => $aTag) {
|
||||
$href = trim($aTag->getAttribute('href'));
|
||||
if ($href === '') {
|
||||
continue;
|
||||
}
|
||||
|
||||
$absoluteUrl = $this->resolveUrl($href, $baseUrl);
|
||||
$text = $this->normaliseText(trim($aTag->textContent));
|
||||
$rel = $aTag->getAttribute('rel');
|
||||
$title = $aTag->getAttribute('title');
|
||||
$target = $aTag->getAttribute('target');
|
||||
|
||||
$links[] = [
|
||||
'index' => $index + 1,
|
||||
'href' => $href,
|
||||
'absolute_url' => $absoluteUrl,
|
||||
'text' => $text,
|
||||
'rel' => $rel !== '' ? $rel : null,
|
||||
'title' => $title !== '' ? $title : null,
|
||||
'target' => $target !== '' ? $target : null,
|
||||
'is_external' => $this->isExternalLink($absoluteUrl, $baseUrl),
|
||||
'link_type' => $this->getLinkType($href),
|
||||
'is_internal' => $this->isInternalLink($absoluteUrl, $baseUrl) ? 1 : 0,
|
||||
];
|
||||
}
|
||||
|
||||
return $links;
|
||||
}
|
||||
|
||||
/**
|
||||
* Prueft, ob ein Link aus Sicht der Basis-URL extern ist.
|
||||
*
|
||||
* @param string $href Ziel des Links.
|
||||
* @param string $baseUrl Ausgangsadresse zur Domainabgleichung.
|
||||
* @return bool|null True fuer extern, false fuer intern, null falls undefiniert.
|
||||
*/
|
||||
private function isExternalLink(string $href, string $baseUrl): ?bool
|
||||
{
|
||||
if ($baseUrl === '') {
|
||||
return null;
|
||||
}
|
||||
|
||||
$baseDomain = parse_url($baseUrl, PHP_URL_HOST);
|
||||
$linkDomain = parse_url($href, PHP_URL_HOST);
|
||||
|
||||
if ($baseDomain === null || $linkDomain === null) {
|
||||
return null;
|
||||
}
|
||||
|
||||
return !hash_equals($baseDomain, $linkDomain);
|
||||
}
|
||||
|
||||
/**
|
||||
* Prueft, ob ein Link derselben Domain wie die Basis-URL entspricht.
|
||||
*
|
||||
* @param string $href Ziel des Links.
|
||||
* @param string $baseUrl Ausgangsadresse zur Domainabgleichung.
|
||||
* @return bool|null True fuer intern, false fuer extern, null falls undefiniert.
|
||||
*/
|
||||
private function isInternalLink(string $href, string $baseUrl): ?bool
|
||||
{
|
||||
if ($baseUrl === '') {
|
||||
return null;
|
||||
}
|
||||
|
||||
$baseDomain = parse_url($baseUrl, PHP_URL_HOST);
|
||||
$linkDomain = parse_url($href, PHP_URL_HOST);
|
||||
|
||||
if ($baseDomain === null || $linkDomain === null) {
|
||||
return null;
|
||||
}
|
||||
|
||||
return hash_equals($baseDomain, $linkDomain);
|
||||
}
|
||||
|
||||
/**
|
||||
* Leitet den Link-Typ anhand gaengiger Protokolle und Muster ab.
|
||||
*
|
||||
* @param string $href Ziel des Links.
|
||||
* @return string Beschreibender Typ wie "absolute" oder "email".
|
||||
*/
|
||||
private function getLinkType(string $href): string
|
||||
{
|
||||
if ($href === '') {
|
||||
return 'empty';
|
||||
}
|
||||
|
||||
$lower = strtolower($href);
|
||||
if (strpos($lower, 'mailto:') === 0) {
|
||||
return 'email';
|
||||
}
|
||||
if (strpos($lower, 'tel:') === 0) {
|
||||
return 'phone';
|
||||
}
|
||||
if (strpos($lower, '#') === 0) {
|
||||
return 'anchor';
|
||||
}
|
||||
if (strpos($lower, 'javascript:') === 0) {
|
||||
return 'javascript';
|
||||
}
|
||||
if (filter_var($href, FILTER_VALIDATE_URL)) {
|
||||
return 'absolute';
|
||||
}
|
||||
|
||||
return 'relative';
|
||||
}
|
||||
|
||||
/**
|
||||
* Gruppiert Links anhand ihres vorab bestimmten Typs.
|
||||
*
|
||||
* @param array<int,array<string,mixed>> $links Liste der extrahierten Links.
|
||||
* @return array<string,array<int,array<string,mixed>>> Links nach Typ gruppiert.
|
||||
*/
|
||||
public function groupLinksByType(array $links): array
|
||||
{
|
||||
$grouped = [];
|
||||
|
||||
foreach ($links as $link) {
|
||||
$type = (string)($link['link_type'] ?? 'unknown');
|
||||
$grouped[$type][] = $link;
|
||||
}
|
||||
|
||||
return $grouped;
|
||||
}
|
||||
|
||||
/**
|
||||
* Erstellt ein konfiguriertes Curl-Handle fuer einen Request.
|
||||
*
|
||||
* @return CurlHandle
|
||||
*/
|
||||
private function createCurlHandle(string $url)
|
||||
{
|
||||
$handle = curl_init($url);
|
||||
if ($handle === false) {
|
||||
throw new RuntimeException('Konnte Curl-Handle nicht initialisieren: ' . $url);
|
||||
}
|
||||
|
||||
curl_setopt_array($handle, [
|
||||
CURLOPT_URL => $url,
|
||||
CURLOPT_RETURNTRANSFER => true,
|
||||
CURLOPT_HEADER => true,
|
||||
CURLOPT_FOLLOWLOCATION => true,
|
||||
CURLOPT_TIMEOUT => self::CURL_TIMEOUT,
|
||||
CURLOPT_USERAGENT => self::USER_AGENT,
|
||||
CURLOPT_SSL_VERIFYPEER => false,
|
||||
]);
|
||||
|
||||
return $handle;
|
||||
}
|
||||
|
||||
/**
|
||||
* Splittet Header und Body und bereitet das Antwort-Array auf.
|
||||
*
|
||||
* @param string $response Vollstaendige Response inkl. Header.
|
||||
* @param array<string,mixed> $info curl_getinfo Ergebnis.
|
||||
* @return array<string,mixed>
|
||||
*/
|
||||
private function buildResponsePayload(string $response, array $info): array
|
||||
{
|
||||
$headerSize = (int)($info['header_size'] ?? 0);
|
||||
$headers = substr($response, 0, $headerSize);
|
||||
$body = substr($response, $headerSize);
|
||||
|
||||
return [
|
||||
'url' => $info['url'] ?? ($info['redirect_url'] ?? ''),
|
||||
'status_code' => (int)($info['http_code'] ?? 0),
|
||||
'headers_parsed' => $this->parseHeaders($headers),
|
||||
'body' => $body,
|
||||
'response_time' => (float)($info['total_time'] ?? 0.0),
|
||||
'body_size' => strlen($body),
|
||||
];
|
||||
}
|
||||
|
||||
/**
|
||||
* Wandelt Header-String in ein assoziatives Array um.
|
||||
*
|
||||
* @param string $headers Roh-Header.
|
||||
* @return array<string,string>
|
||||
*/
|
||||
private function parseHeaders(string $headers): array
|
||||
{
|
||||
$parsed = [];
|
||||
foreach (preg_split('/\r?\n/', trim($headers)) as $line) {
|
||||
if ($line === '' || strpos($line, ':') === false) {
|
||||
continue;
|
||||
}
|
||||
|
||||
[$key, $value] = explode(':', $line, 2);
|
||||
$parsed[trim($key)] = trim($value);
|
||||
}
|
||||
|
||||
return $parsed;
|
||||
}
|
||||
|
||||
/**
|
||||
* Normalisiert relativen Pfad gegenueber einer Basis-URL zu einer absoluten Adresse.
|
||||
*/
|
||||
private function resolveUrl(string $href, string $baseUrl): string
|
||||
{
|
||||
if ($href === '' || filter_var($href, FILTER_VALIDATE_URL)) {
|
||||
return $href;
|
||||
}
|
||||
|
||||
if ($baseUrl === '') {
|
||||
return $href;
|
||||
}
|
||||
|
||||
$baseParts = parse_url($baseUrl);
|
||||
if ($baseParts === false || !isset($baseParts['scheme'], $baseParts['host'])) {
|
||||
return $href;
|
||||
}
|
||||
|
||||
$scheme = $baseParts['scheme'];
|
||||
$host = $baseParts['host'];
|
||||
$port = isset($baseParts['port']) ? ':' . $baseParts['port'] : '';
|
||||
$basePath = $baseParts['path'] ?? '/';
|
||||
|
||||
if (strpos($href, '/') === 0) {
|
||||
$path = $href;
|
||||
} else {
|
||||
if (substr($basePath, -1) !== '/') {
|
||||
$basePath = preg_replace('#/[^/]*$#', '/', $basePath) ?: '/';
|
||||
}
|
||||
$path = $basePath . $href;
|
||||
}
|
||||
|
||||
return sprintf('%s://%s%s%s', $scheme, $host, $port, '/' . ltrim($path, '/'));
|
||||
}
|
||||
|
||||
/**
|
||||
* Sorgt fuer sauberen UTF-8 Text ohne Steuerzeichen.
|
||||
*/
|
||||
private function normaliseText(string $text): string
|
||||
{
|
||||
$normalized = preg_replace('/\s+/u', ' ', $text) ?? '';
|
||||
$encoding = mb_detect_encoding($normalized, ['UTF-8', 'ISO-8859-1', 'Windows-1252'], true) ?: 'UTF-8';
|
||||
|
||||
return trim(mb_convert_encoding($normalized, 'UTF-8', $encoding));
|
||||
}
|
||||
|
||||
/**
|
||||
* Ermittelt die ID einer URL innerhalb eines Crawl-Durchlaufs.
|
||||
*/
|
||||
private function resolveUrlId(int $crawlID, string $url): ?int
|
||||
{
|
||||
$statement = $this->db->prepare('SELECT id FROM urls WHERE url = ? AND crawl_id = ? LIMIT 1');
|
||||
if ($statement === false) {
|
||||
return null;
|
||||
}
|
||||
|
||||
$statement->bind_param('si', $url, $crawlID);
|
||||
$statement->execute();
|
||||
$result = $statement->get_result();
|
||||
$id = $result ? $result->fetch_assoc()['id'] ?? null : null;
|
||||
$statement->close();
|
||||
|
||||
return $id !== null ? (int)$id : null;
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user