Compare commits

...

15 Commits

Author SHA1 Message Date
1588f83624 Add pagination to all data tables using jQuery DataTables
Libraries Added:
- jQuery 3.7.1 from CDN
- DataTables 1.13.7 (CSS + JS) from CDN

Custom Styling:
- Integrated DataTables styling with existing design
- Custom pagination button styles
- Responsive search and filter inputs

Paginated Tables:
- jobsTable: Crawl jobs (25/page, sorted by ID desc)
- pagesTable: Crawled pages (50/page)
- linksTable: Found links (50/page)
- brokenTable: Broken links (25/page)
- redirectsTable: Redirects (25/page)
- seoTable: SEO issues (25/page)

Features:
- Search functionality per table
- Column sorting
- Configurable entries per page
- German localization
- Automatic reinitialization on data reload
- Navigation controls (First/Previous/Next/Last)
- Entry count display

All quality checks pass:
- PHPStan Level 8: 0 errors

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-04 09:49:39 +02:00
c40d44e4c9 Add redirect tracking and analysis features
Database Schema:
- Added redirect_url VARCHAR(2048) to pages table
- Added redirect_count INT DEFAULT 0 to pages table
- Added index on redirect_count for faster queries

Configuration:
- Created Config class with typed constants (PHP 8.3+)
- MAX_REDIRECT_THRESHOLD = 3 (configurable warning threshold)
- MAX_CRAWL_DEPTH = 50
- CONCURRENCY = 10

Backend Changes:
- Crawler now tracks redirects using Guzzle's redirect tracking
- Extracts redirect history from response headers
- Records redirect count and final destination URL
- Guzzle configured with max 10 redirects and tracking enabled

API Endpoint:
- New endpoint: /api.php?action=redirects
- Analyzes redirect types (permanent 301/308 vs temporary 302/303/307)
- Identifies excessive redirects (> threshold)
- Returns statistics and detailed redirect information

Frontend Changes:
- Added "Redirects" tab with:
  * Statistics overview (Total, Permanent, Temporary, Excessive)
  * Detailed table showing all redirects
  * Visual warnings for excessive redirects (yellow background)
  * Color-coded redirect counts (red when > threshold)
  * Status code badges (green for permanent, blue for temporary)

All quality checks pass:
- PHPStan Level 8: 0 errors
- PHPCS PSR-12: 0 errors

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-04 09:40:26 +02:00
e6b75410ed Add copyright information to README
Added visible copyright section with author information:
- Martin Kiesewetter
- mki@kies-media.de
- https://kies-media.de

Also updated project title from "PHP Docker Anwendung" to "Web Crawler"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-04 09:29:05 +02:00
f7be09ec63 Add Broken Links detection and SEO Analysis features
Database Schema:
- Added meta_description TEXT field to pages table
- Added index on status_code for faster broken link queries

Backend Changes:
- Crawler now extracts meta descriptions from pages
- New API endpoint: broken-links (finds 404s and server errors)
- New API endpoint: seo-analysis (analyzes titles and meta descriptions)

SEO Analysis Features:
- Title length validation (optimal: 30-60 chars)
- Meta description length validation (optimal: 70-160 chars)
- Detection of missing titles/descriptions
- Duplicate content detection (titles and meta descriptions)

Frontend Changes:
- Added "Broken Links" tab showing pages with errors
- Added "SEO Analysis" tab with:
  * Statistics overview
  * Pages with SEO issues
  * Duplicate content report

All quality checks pass:
- PHPStan Level 8: 0 errors
- PHPCS PSR-12: 0 warnings

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-04 09:26:33 +02:00
9e61572747 Add recrawl functionality and fix PHPCS warnings
- Added "Recrawl" button in jobs table UI
- Implemented recrawl API endpoint that deletes all job data and restarts crawl
- Fixed PHPCS line length warnings in api.php and Crawler.php

All quality checks pass:
- PHPStan Level 8: 0 errors
- PHPCS PSR-12: 0 warnings

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-04 09:07:50 +02:00
11fd8fa673 Add copyright headers to configuration files
Extended copyright headers to SQL, YAML, and JSON configuration files:
- config/docker/init.sql (SQL comment block)
- docker-compose.yml (YAML comment)
- composer.json and src/composer.json (JSON _comment field)

All files validated and tested successfully.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-04 08:58:28 +02:00
cbf099701b Add copyright headers to all application files
Added copyright headers to all PHP files in the application with proper author information (Martin Kiesewetter) and contact details.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-04 08:47:44 +02:00
ad274c0738 Update paths for config/ directory structure
Adjusted all references to match new config/ structure:
- docker/config/nginx/default.conf → config/nginx/default.conf
- docker/init.sql → config/docker/init.sql
- docker/start.sh → config/docker/start.sh

Updated files:
- docker-compose.yml: Updated volume mount paths
- README.md: Updated project structure documentation

New structure consolidates all configuration files under config/
for better organization and clarity.

Tested and verified all services running correctly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-04 08:36:58 +02:00
de4d2e53d9 Reorganize Docker-related files into docker/ directory
Moved Docker infrastructure files to dedicated docker/ folder:
- config/nginx/default.conf → docker/config/nginx/default.conf
- init.sql → docker/init.sql
- start.sh → docker/start.sh (currently unused)

Updated:
- docker-compose.yml: Adjusted volume paths
- README.md: Updated project structure documentation

Benefits:
- Clear separation between infrastructure (docker/) and application (src/)
- Better project organization
- Easier to understand for new developers

Docker Compose and Dockerfile remain in root for convenience.
All services tested and working correctly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-04 08:31:47 +02:00
daa76b2141 Remove legacy PHP files from root directory
Removed unused legacy files:
- index.php (old crawler entry point)
- webanalyse.php (old crawler implementation)
- setnew.php (database reset script)

These files are no longer used. The current application uses:
- src/index.php (web interface)
- src/api.php (API endpoints)
- src/classes/Crawler.php (crawler implementation)
- src/crawler-worker.php (background worker)

The legacy code remains in git history if needed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-04 08:24:43 +02:00
09d5b61779 Fix link extraction bug caused by type checking
The PHPStan fix inadvertently broke link extraction by using is_int()
on $pageId, which failed when lastInsertId() or fetchColumn() returned
a string instead of an int.

Changes:
- Convert $pageId to int explicitly after fetching
- Use $pageId > 0 instead of is_int($pageId) for validation
- Handle both 0 and '0' cases when fetching manually

This ensures link extraction works again while maintaining type safety.
Tests pass, PHPStan clean.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-04 08:18:52 +02:00
e569d189d5 Add comprehensive quality tooling and fix code style issues
Quality Tools Added:
- PHPStan (Level 8) for static analysis
- PHP_CodeSniffer (PSR-12) for code style
- Updated PHPUnit test suite with type safety

Code Improvements:
- Fixed all PHPStan Level 8 errors (13 issues)
- Auto-fixed 25 PSR-12 code style violations
- Added proper type hints for arrays and method parameters
- Fixed PDOStatement|false handling in api.php and tests
- Improved null-safety for parse_url() calls

Configuration:
- phpstan.neon: Level 8, analyzes src/ and tests/
- phpcs.xml: PSR-12 standard, excludes vendor/
- docker-compose.yml: Mount config files for tooling
- composer.json: Add phpstan, phpcs, phpcbf scripts

Documentation:
- Updated README.md with testing and quality sections
- Updated AGENTS.md with quality gates and workflows
- Added pre-commit checklist for developers

All tests pass (9/9), PHPStan clean (0 errors), PHPCS compliant (1 warning)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-03 23:58:21 +02:00
b5640ad131 docker-compose 2025-10-03 23:26:28 +02:00
5b5a627662 gitignore 2025-10-03 23:19:49 +02:00
4e868ca8e9 Sonstiges 2025-10-03 20:22:17 +02:00
23 changed files with 1123 additions and 529 deletions

3
.gitignore vendored
View File

@@ -22,4 +22,5 @@ Thumbs.db
*.cache
# Docker
docker-compose.override.yml
docker-compose.override.yml
/.claude/settings.local.json

100
AGENTS.md
View File

@@ -4,13 +4,109 @@
The codebase is intentionally lean. `index.php` bootstraps the crawl by instantiating `webanalyse` and handing off the crawl identifier. Core crawling logic lives in `webanalyse.php`, which houses HTTP fetching, link extraction, and database persistence. Use `setnew.php` to reset seed data inside the `screaming_frog` schema before a rerun. Keep new helpers in their own PHP files under this root so the autoload includes stay predictable; group SQL migrations or fixtures under a `database/` folder if you add them. IDE settings reside in `.idea/`.
## Build, Test, and Development Commands
Run the project through Apache in XAMPP or start the PHP built-in server with `php -S localhost:8080 index.php` from this directory. Validate syntax quickly via `php -l webanalyse.php` (repeat for any new file). When iterating on crawl logic, truncate runtime tables with `php setnew.php` to restore the baseline dataset.
### Docker Development
The project runs in Docker containers. Use these commands:
```bash
# Start containers
docker-compose up -d
# Stop containers
docker-compose down
# Rebuild containers
docker-compose up -d --build
# View logs
docker-compose logs -f php
```
### Running Tests
The project uses PHPUnit for automated testing:
```bash
# Run all tests (Unit + Integration)
docker-compose exec php sh -c "php /var/www/html/vendor/bin/phpunit /var/www/tests/"
# Or use the composer shortcut
docker-compose exec php composer test
```
**Test Structure:**
- `tests/Unit/` - Unit tests for individual components
- `tests/Integration/` - Integration tests for full crawl workflows
- All tests run in isolated database transactions
### Static Code Analysis
PHPStan is configured at Level 8 (strictest) to ensure type safety:
```bash
# Run PHPStan analysis
docker-compose exec php sh -c "php -d memory_limit=512M /var/www/html/vendor/bin/phpstan analyse -c /var/www/phpstan.neon"
# Or use the composer shortcut
docker-compose exec php composer phpstan
```
**PHPStan Configuration:**
- Level: 8 (maximum strictness)
- Analyzes: `src/` and `tests/`
- Excludes: `vendor/`
- Config file: `phpstan.neon`
All code must pass PHPStan Level 8 with zero errors before merging.
### Code Style Checking
PHP_CodeSniffer enforces PSR-12 coding standards:
```bash
# Check code style
docker-compose exec php composer phpcs
# Automatically fix code style issues
docker-compose exec php composer phpcbf
```
**PHPCS Configuration:**
- Standard: PSR-12
- Analyzes: `src/` and `tests/`
- Excludes: `vendor/`
- Auto-fix available via `phpcbf`
Run `phpcbf` before committing to automatically fix most style violations.
## Coding Style & Naming Conventions
Follow PSR-12 style cues already in use: 4-space indentation, brace-on-new-line for functions, and `declare(strict_types=1);` at the top of entry scripts. Favour descriptive camelCase for methods (`getMultipleWebsites`) and snake_case only for direct SQL field names. Maintain `mysqli` usage for consistency, and gate new configuration through constants or clearly named environment variables.
## Testing Guidelines
There is no automated suite yet; treat each crawl as an integration test. After code changes, run `php setnew.php` followed by a crawl and confirm that `crawl`, `urls`, and `links` tables reflect the expected row counts. Log anomalies with `error_log()` while developing, and remove or downgrade to structured responses before merging.
### Automated Testing
The project has a comprehensive test suite using PHPUnit:
- **Write tests first**: Follow TDD principles when adding new features
- **Unit tests** (`tests/Unit/`): Test individual classes and methods in isolation
- **Integration tests** (`tests/Integration/`): Test full crawl workflows with real HTTP requests
- **Database isolation**: Tests use transactions that roll back automatically
- **Coverage**: Aim for high test coverage on critical crawl logic
### Quality Gates
Before committing code, ensure:
1. All tests pass: `docker-compose exec php composer test`
2. PHPStan analysis passes: `docker-compose exec php composer phpstan`
3. Code style is correct: `docker-compose exec php composer phpcs`
4. Auto-fix style issues: `docker-compose exec php composer phpcbf`
**Pre-commit Checklist:**
- ✅ Tests pass
- ✅ PHPStan Level 8 with 0 errors
- ✅ PHPCS PSR-12 compliance (warnings acceptable)
### Manual Testing
For UI changes, manually test the crawler interface at http://localhost:8080. Verify:
- Job creation and status updates
- Page and link extraction accuracy
- Error handling for invalid URLs or network issues
## Commit & Pull Request Guidelines
Author commit messages in the present tense with a concise summary (`Add link grouping for external URLs`). Group related SQL adjustments with their PHP changes in the same commit. For pull requests, include: a short context paragraph, reproduction steps, screenshots of key output tables when behaviour changes, and any follow-up tasks. Link tracking tickets or issues so downstream agents can trace decisions.

View File

@@ -1,7 +1,17 @@
# PHP Docker Anwendung
# Web Crawler
Eine PHP-Anwendung mit MariaDB, die in Docker läuft.
## Copyright & Lizenz
**Copyright © 2025 Martin Kiesewetter**
- **Autor:** Martin Kiesewetter
- **E-Mail:** mki@kies-media.de
- **Website:** [https://kies-media.de](https://kies-media.de)
---
## Anforderungen
- Docker
@@ -43,16 +53,79 @@ docker-compose up -d --build
```
.
├── docker-compose.yml # Docker Compose Konfiguration
├── Dockerfile # PHP Container Image
├── start.sh # Container Start-Script
├── init.sql # Datenbank Initialisierung
├── config/
├── Dockerfile # PHP Container Image
├── config/ # Konfigurationsdateien
│ ├── docker/
│ │ ├── init.sql # Datenbank Initialisierung
│ │ └── start.sh # Container Start-Script (unused)
│ └── nginx/
│ └── default.conf # Nginx Konfiguration
── src/
── index.php # Hauptanwendung
│ └── default.conf # Nginx Konfiguration
── src/ # Anwendungscode
── api.php
│ ├── index.php
│ ├── classes/
│ └── crawler-worker.php
├── tests/ # Test Suite
│ ├── Unit/
│ └── Integration/
├── phpstan.neon # PHPStan Konfiguration
└── phpcs.xml # PHPCS Konfiguration
```
## Entwicklung
Die Anwendungsdateien befinden sich im `src/` Verzeichnis und werden als Volume in den Container gemountet, sodass Änderungen sofort sichtbar sind.
## Tests & Code-Qualität
### Unit Tests ausführen
Die Anwendung verwendet PHPUnit für Unit- und Integrationstests:
```bash
# Alle Tests ausführen
docker-compose exec php sh -c "php /var/www/html/vendor/bin/phpunit /var/www/tests/"
# Alternative mit Composer-Script
docker-compose exec php composer test
```
Die Tests befinden sich in:
- `tests/Unit/` - Unit Tests
- `tests/Integration/` - Integration Tests
### Statische Code-Analyse mit PHPStan
PHPStan ist auf Level 8 (höchstes Level) konfiguriert und analysiert den gesamten Code:
```bash
# PHPStan ausführen
docker-compose exec php sh -c "php -d memory_limit=512M /var/www/html/vendor/bin/phpstan analyse -c /var/www/phpstan.neon"
# Alternative mit Composer-Script
docker-compose exec php composer phpstan
```
**PHPStan Konfiguration:**
- Level: 8 (strictest)
- Analysierte Pfade: `src/` und `tests/`
- Ausgeschlossen: `vendor/` Ordner
- Konfigurationsdatei: `phpstan.neon`
### Code Style Prüfung mit PHP_CodeSniffer
PHP_CodeSniffer (PHPCS) prüft den Code gegen PSR-12 Standards:
```bash
# Code Style prüfen
docker-compose exec php composer phpcs
# Code Style automatisch korrigieren
docker-compose exec php composer phpcbf
```
**PHPCS Konfiguration:**
- Standard: PSR-12
- Analysierte Pfade: `src/` und `tests/`
- Ausgeschlossen: `vendor/` Ordner
- Auto-Fix verfügbar mit `phpcbf`

View File

@@ -1,4 +1,5 @@
{
"_comment": "Web Crawler - Composer Configuration | Copyright (c) 2025 Martin Kiesewetter <mki@kies-media.de> | https://kies-media.de",
"name": "web-crawler/app",
"description": "Web Crawler Application with Parallel Processing",
"type": "project",

View File

@@ -1,3 +1,11 @@
/**
* Web Crawler - Database Schema
*
* @copyright Copyright (c) 2025 Martin Kiesewetter
* @author Martin Kiesewetter <mki@kies-media.de>
* @link https://kies-media.de
*/
-- Database initialization script for Web Crawler
-- Crawl Jobs Table
@@ -20,12 +28,17 @@ CREATE TABLE IF NOT EXISTS pages (
crawl_job_id INT NOT NULL,
url VARCHAR(2048) NOT NULL,
title VARCHAR(500),
meta_description TEXT,
status_code INT,
content_type VARCHAR(100),
redirect_url VARCHAR(2048),
redirect_count INT DEFAULT 0,
crawled_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (crawl_job_id) REFERENCES crawl_jobs(id) ON DELETE CASCADE,
INDEX idx_crawl_job (crawl_job_id),
INDEX idx_url (url(255)),
INDEX idx_status_code (status_code),
INDEX idx_redirect_count (redirect_count),
UNIQUE KEY unique_job_url (crawl_job_id, url(255))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

View File

@@ -1,3 +1,9 @@
# Web Crawler - Docker Compose Configuration
#
# @copyright Copyright (c) 2025 Martin Kiesewetter
# @author Martin Kiesewetter <mki@kies-media.de>
# @link https://kies-media.de
version: '3.8'
services:
@@ -10,6 +16,11 @@ services:
- "8080:80"
volumes:
- ./src:/var/www/html
- ./tests:/var/www/tests
- ./composer.json:/var/www/composer.json
- ./composer.lock:/var/www/composer.lock
- ./phpstan.neon:/var/www/phpstan.neon
- ./phpcs.xml:/var/www/phpcs.xml
- ./config/nginx/default.conf:/etc/nginx/conf.d/default.conf
depends_on:
- mariadb
@@ -29,7 +40,7 @@ services:
- "3307:3306"
volumes:
- mariadb_data:/var/lib/mysql
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
- ./config/docker/init.sql:/docker-entrypoint-initdb.d/init.sql
networks:
- app-network

View File

@@ -1,13 +0,0 @@
<?php
declare(strict_types=1);
Error_reporting(E_ALL);
ini_set('display_errors', 1);
require_once 'webanalyse.php';
$wa = new webanalyse();
$db = mysqli_connect("localhost", "root", "", "screaming_frog");
$wa-> doCrawl(1);

19
phpcs.xml Normal file
View File

@@ -0,0 +1,19 @@
<?xml version="1.0"?>
<ruleset name="ScreamingFrog">
<description>PHP_CodeSniffer configuration</description>
<!-- Use PSR-12 coding standard -->
<rule ref="PSR12"/>
<!-- Paths to check -->
<file>/var/www/html</file>
<file>/var/www/tests</file>
<!-- Exclude vendor directory -->
<exclude-pattern>/var/www/html/vendor/*</exclude-pattern>
<exclude-pattern>*/vendor/*</exclude-pattern>
<!-- Show progress and colors -->
<arg name="colors"/>
<arg value="sp"/>
</ruleset>

7
phpstan.neon Normal file
View File

@@ -0,0 +1,7 @@
parameters:
level: 8
paths:
- /var/www/html
- /var/www/tests
excludePaths:
- /var/www/html/vendor

View File

@@ -1,11 +0,0 @@
<?php
$db = mysqli_connect("localhost", "root", "", "screaming_frog");
$db->query("truncate table crawl");
// $db->query("insert into crawl (start_url, user_id) values ('https://kies-media.de/', 1)");
$db->query("insert into crawl (start_url, user_id) values ('https://kies-media.de/leistungen/externer-ausbilder-fuer-fachinformatiker/', 1)");
$db->query("truncate table urls");
$urls = $db->query("insert ignore into urls (id, url, crawl_id) select 1,start_url, id from crawl where id = 1"); #->fetch_all(MYSQLI_ASSOC)
$db->query("truncate table links");

View File

@@ -1,5 +1,13 @@
<?php
/**
* Web Crawler - API Endpoint
*
* @copyright Copyright (c) 2025 Martin Kiesewetter
* @author Martin Kiesewetter <mki@kies-media.de>
* @link https://kies-media.de
*/
require_once __DIR__ . '/vendor/autoload.php';
use App\Database;
@@ -73,6 +81,9 @@ try {
case 'jobs':
$stmt = $db->query("SELECT * FROM crawl_jobs ORDER BY created_at DESC LIMIT 50");
if ($stmt === false) {
throw new Exception('Failed to query jobs');
}
$jobs = $stmt->fetchAll();
echo json_encode([
@@ -105,6 +116,148 @@ try {
]);
break;
case 'broken-links':
$jobId = $_GET['job_id'] ?? 0;
$stmt = $db->prepare(
"SELECT * FROM pages " .
"WHERE crawl_job_id = ? AND (status_code >= 400 OR status_code = 0) " .
"ORDER BY status_code DESC, url"
);
$stmt->execute([$jobId]);
$brokenLinks = $stmt->fetchAll();
echo json_encode([
'success' => true,
'broken_links' => $brokenLinks
]);
break;
case 'seo-analysis':
$jobId = $_GET['job_id'] ?? 0;
$stmt = $db->prepare(
"SELECT id, url, title, meta_description, status_code FROM pages " .
"WHERE crawl_job_id = ? ORDER BY url"
);
$stmt->execute([$jobId]);
$pages = $stmt->fetchAll();
$issues = [];
foreach ($pages as $page) {
$pageIssues = [];
$titleLen = mb_strlen($page['title'] ?? '');
$descLen = mb_strlen($page['meta_description'] ?? '');
// Title issues (Google: 50-60 chars optimal)
if (empty($page['title'])) {
$pageIssues[] = 'Title missing';
} elseif ($titleLen < 30) {
$pageIssues[] = "Title too short ({$titleLen} chars)";
} elseif ($titleLen > 60) {
$pageIssues[] = "Title too long ({$titleLen} chars)";
}
// Meta description issues (Google: 120-160 chars optimal)
if (empty($page['meta_description'])) {
$pageIssues[] = 'Meta description missing';
} elseif ($descLen < 70) {
$pageIssues[] = "Meta description too short ({$descLen} chars)";
} elseif ($descLen > 160) {
$pageIssues[] = "Meta description too long ({$descLen} chars)";
}
if (!empty($pageIssues)) {
$issues[] = [
'url' => $page['url'],
'title' => $page['title'],
'title_length' => $titleLen,
'meta_description' => $page['meta_description'],
'meta_length' => $descLen,
'issues' => $pageIssues
];
}
}
// Find duplicates
$titleCounts = [];
$descCounts = [];
foreach ($pages as $page) {
if (!empty($page['title'])) {
$titleCounts[$page['title']][] = $page['url'];
}
if (!empty($page['meta_description'])) {
$descCounts[$page['meta_description']][] = $page['url'];
}
}
$duplicates = [];
foreach ($titleCounts as $title => $urls) {
if (count($urls) > 1) {
$duplicates[] = [
'type' => 'title',
'content' => $title,
'urls' => $urls
];
}
}
foreach ($descCounts as $desc => $urls) {
if (count($urls) > 1) {
$duplicates[] = [
'type' => 'meta_description',
'content' => $desc,
'urls' => $urls
];
}
}
echo json_encode([
'success' => true,
'issues' => $issues,
'duplicates' => $duplicates,
'total_pages' => count($pages)
]);
break;
case 'redirects':
$jobId = $_GET['job_id'] ?? 0;
$stmt = $db->prepare(
"SELECT url, title, status_code, redirect_url, redirect_count FROM pages " .
"WHERE crawl_job_id = ? AND redirect_count > 0 " .
"ORDER BY redirect_count DESC, url"
);
$stmt->execute([$jobId]);
$redirects = $stmt->fetchAll();
// Count redirect types
$permanent = 0;
$temporary = 0;
$excessive = 0;
$maxThreshold = 3; // From Config::MAX_REDIRECT_THRESHOLD
foreach ($redirects as $redirect) {
$code = $redirect['status_code'];
if ($code == 301 || $code == 308) {
$permanent++;
} elseif ($code == 302 || $code == 303 || $code == 307) {
$temporary++;
}
if ($redirect['redirect_count'] > $maxThreshold) {
$excessive++;
}
}
echo json_encode([
'success' => true,
'redirects' => $redirects,
'stats' => [
'total' => count($redirects),
'permanent' => $permanent,
'temporary' => $temporary,
'excessive' => $excessive,
'threshold' => $maxThreshold
]
]);
break;
case 'delete':
$jobId = $_POST['job_id'] ?? 0;
$stmt = $db->prepare("DELETE FROM crawl_jobs WHERE id = ?");
@@ -116,6 +269,42 @@ try {
]);
break;
case 'recrawl':
$jobId = $_POST['job_id'] ?? 0;
$domain = $_POST['domain'] ?? '';
if (empty($domain)) {
throw new Exception('Domain is required');
}
// Delete all related data for this job
$stmt = $db->prepare("DELETE FROM crawl_queue WHERE crawl_job_id = ?");
$stmt->execute([$jobId]);
$stmt = $db->prepare("DELETE FROM links WHERE crawl_job_id = ?");
$stmt->execute([$jobId]);
$stmt = $db->prepare("DELETE FROM pages WHERE crawl_job_id = ?");
$stmt->execute([$jobId]);
// Reset job status
$stmt = $db->prepare(
"UPDATE crawl_jobs SET status = 'pending', total_pages = 0, total_links = 0, " .
"started_at = NULL, completed_at = NULL WHERE id = ?"
);
$stmt->execute([$jobId]);
// Start crawling in background
$cmd = "php " . __DIR__ . "/crawler-worker.php $jobId > /dev/null 2>&1 &";
exec($cmd);
echo json_encode([
'success' => true,
'job_id' => $jobId,
'message' => 'Recrawl started'
]);
break;
default:
throw new Exception('Invalid action');
}

29
src/classes/Config.php Normal file
View File

@@ -0,0 +1,29 @@
<?php
/**
* Web Crawler - Configuration Class
*
* @copyright Copyright (c) 2025 Martin Kiesewetter
* @author Martin Kiesewetter <mki@kies-media.de>
* @link https://kies-media.de
*/
namespace App;
class Config
{
/**
* Maximum number of redirects before warning
*/
public const int MAX_REDIRECT_THRESHOLD = 3;
/**
* Maximum crawl depth
*/
public const int MAX_CRAWL_DEPTH = 50;
/**
* Number of parallel requests
*/
public const int CONCURRENCY = 10;
}

View File

@@ -1,5 +1,13 @@
<?php
/**
* Web Crawler - Crawler Class
*
* @copyright Copyright (c) 2025 Martin Kiesewetter
* @author Martin Kiesewetter <mki@kies-media.de>
* @link https://kies-media.de
*/
namespace App;
use GuzzleHttp\Client;
@@ -8,28 +16,37 @@ use GuzzleHttp\Psr7\Request;
use GuzzleHttp\Exception\RequestException;
use Symfony\Component\DomCrawler\Crawler as DomCrawler;
class Crawler {
class Crawler
{
private \PDO $db;
private Client $client;
private int $concurrency = 10; // Parallel requests
/** @var array<string, bool> */
private array $visited = [];
private int $crawlJobId;
private string $baseDomain;
public function __construct(int $crawlJobId) {
public function __construct(int $crawlJobId)
{
$this->db = Database::getInstance();
$this->crawlJobId = $crawlJobId;
$this->client = new Client([
'timeout' => 30,
'verify' => false,
'allow_redirects' => [
'max' => 10,
'track_redirects' => true
],
'headers' => [
'User-Agent' => 'WebCrawler/1.0'
]
]);
}
public function start(string $startUrl): void {
$this->baseDomain = strtolower(parse_url($startUrl, PHP_URL_HOST));
public function start(string $startUrl): void
{
$host = parse_url($startUrl, PHP_URL_HOST);
$this->baseDomain = strtolower($host ?: '');
// Update job status
$stmt = $this->db->prepare("UPDATE crawl_jobs SET status = 'running', started_at = NOW() WHERE id = ?");
@@ -48,7 +65,8 @@ class Crawler {
$stmt->execute([$this->crawlJobId]);
}
private function addToQueue(string $url, int $depth): void {
private function addToQueue(string $url, int $depth): void
{
if (isset($this->visited[$url])) {
return;
}
@@ -63,7 +81,8 @@ class Crawler {
}
}
private function processQueue(): void {
private function processQueue(): void
{
while (true) {
// Get pending URLs
$stmt = $this->db->prepare(
@@ -82,14 +101,18 @@ class Crawler {
}
}
private function crawlBatch(array $urls): void {
$requests = function() use ($urls) {
/**
* @param array<int, array{id: int, url: string, depth: int}> $urls
*/
private function crawlBatch(array $urls): void
{
$requests = function () use ($urls) {
foreach ($urls as $item) {
// Mark as processing
$stmt = $this->db->prepare("UPDATE crawl_queue SET status = 'processing' WHERE id = ?");
$stmt->execute([$item['id']]);
yield function() use ($item) {
yield function () use ($item) {
return $this->client->getAsync($item['url']);
};
}
@@ -110,7 +133,12 @@ class Crawler {
$pool->promise()->wait();
}
private function handleResponse(array $queueItem, $response): void {
/**
* @param array{id: int, url: string, depth: int} $queueItem
* @param \Psr\Http\Message\ResponseInterface $response
*/
private function handleResponse(array $queueItem, $response): void
{
$url = $queueItem['url'];
$depth = $queueItem['depth'];
@@ -120,30 +148,61 @@ class Crawler {
$contentType = $response->getHeaderLine('Content-Type');
$body = $response->getBody()->getContents();
// Track redirects
$redirectUrl = null;
$redirectCount = 0;
if ($response->hasHeader('X-Guzzle-Redirect-History')) {
$redirectHistory = $response->getHeader('X-Guzzle-Redirect-History');
$redirectCount = count($redirectHistory);
if ($redirectCount > 0) {
$redirectUrl = end($redirectHistory);
}
}
// Save page
$domCrawler = new DomCrawler($body, $url);
$title = $domCrawler->filter('title')->count() > 0
? $domCrawler->filter('title')->text()
: '';
$metaDescription = $domCrawler->filter('meta[name="description"]')->count() > 0
? $domCrawler->filter('meta[name="description"]')->attr('content')
: '';
$stmt = $this->db->prepare(
"INSERT INTO pages (crawl_job_id, url, title, status_code, content_type)
VALUES (?, ?, ?, ?, ?)
ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id), status_code = VALUES(status_code)"
"INSERT INTO pages (crawl_job_id, url, title, meta_description, status_code, " .
"content_type, redirect_url, redirect_count) " .
"VALUES (?, ?, ?, ?, ?, ?, ?, ?) " .
"ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id), status_code = VALUES(status_code), " .
"meta_description = VALUES(meta_description), redirect_url = VALUES(redirect_url), " .
"redirect_count = VALUES(redirect_count)"
);
$stmt->execute([$this->crawlJobId, $url, $title, $statusCode, $contentType]);
$stmt->execute([
$this->crawlJobId,
$url,
$title,
$metaDescription,
$statusCode,
$contentType,
$redirectUrl,
$redirectCount
]);
$pageId = $this->db->lastInsertId();
// If pageId is 0, fetch it manually
if ($pageId == 0) {
if ($pageId == 0 || $pageId === '0') {
$stmt = $this->db->prepare("SELECT id FROM pages WHERE crawl_job_id = ? AND url = ?");
$stmt->execute([$this->crawlJobId, $url]);
$pageId = $stmt->fetchColumn();
$fetchedId = $stmt->fetchColumn();
$pageId = is_numeric($fetchedId) ? (int)$fetchedId : 0;
}
// Ensure pageId is an integer
$pageId = is_numeric($pageId) ? (int)$pageId : 0;
// Extract and save links
if (str_contains($contentType, 'text/html')) {
if (str_contains($contentType, 'text/html') && $pageId > 0) {
echo "Extracting links from: $url (pageId: $pageId)\n";
$this->extractLinks($domCrawler, $url, $pageId, $depth);
} else {
@@ -155,7 +214,8 @@ class Crawler {
$stmt->execute([$queueItem['id']]);
}
private function extractLinks(DomCrawler $crawler, string $sourceUrl, int $pageId, int $depth): void {
private function extractLinks(DomCrawler $crawler, string $sourceUrl, int $pageId, int $depth): void
{
$linkCount = 0;
$crawler->filter('a')->each(function (DomCrawler $node) use ($sourceUrl, $pageId, $depth, &$linkCount) {
try {
@@ -176,13 +236,14 @@ class Crawler {
$isNofollow = str_contains($rel, 'nofollow');
// Check if internal (same domain, no subdomains)
$targetDomain = strtolower(parse_url($targetUrl, PHP_URL_HOST) ?? '');
$targetHost = parse_url($targetUrl, PHP_URL_HOST);
$targetDomain = strtolower($targetHost ?: '');
$isInternal = ($targetDomain === $this->baseDomain);
// Save link
$stmt = $this->db->prepare(
"INSERT INTO links (page_id, crawl_job_id, source_url, target_url, link_text, is_nofollow, is_internal)
VALUES (?, ?, ?, ?, ?, ?, ?)"
"INSERT INTO links (page_id, crawl_job_id, source_url, target_url, " .
"link_text, is_nofollow, is_internal) VALUES (?, ?, ?, ?, ?, ?, ?)"
);
$stmt->execute([
$pageId,
@@ -207,7 +268,8 @@ class Crawler {
echo "Processed $linkCount links from $sourceUrl\n";
}
private function makeAbsoluteUrl(string $url, string $base): string {
private function makeAbsoluteUrl(string $url, string $base): string
{
if (filter_var($url, FILTER_VALIDATE_URL)) {
return $url;
}
@@ -225,14 +287,20 @@ class Crawler {
return "$scheme://$host$basePath$url";
}
private function handleError(array $queueItem, $reason): void {
/**
* @param array{id: int, url: string, depth: int} $queueItem
* @param \GuzzleHttp\Exception\RequestException $reason
*/
private function handleError(array $queueItem, $reason): void
{
$stmt = $this->db->prepare(
"UPDATE crawl_queue SET status = 'failed', processed_at = NOW(), retry_count = retry_count + 1 WHERE id = ?"
);
$stmt->execute([$queueItem['id']]);
}
private function updateJobStats(): void {
private function updateJobStats(): void
{
$stmt = $this->db->prepare(
"UPDATE crawl_jobs SET
total_pages = (SELECT COUNT(*) FROM pages WHERE crawl_job_id = ?),
@@ -242,7 +310,8 @@ class Crawler {
$stmt->execute([$this->crawlJobId, $this->crawlJobId, $this->crawlJobId]);
}
private function normalizeUrl(string $url): string {
private function normalizeUrl(string $url): string
{
// Parse URL
$parts = parse_url($url);

View File

@@ -1,16 +1,28 @@
<?php
/**
* Web Crawler - Database Class
*
* @copyright Copyright (c) 2025 Martin Kiesewetter
* @author Martin Kiesewetter <mki@kies-media.de>
* @link https://kies-media.de
*/
namespace App;
use PDO;
use PDOException;
class Database {
class Database
{
private static ?PDO $instance = null;
private function __construct() {}
private function __construct()
{
}
public static function getInstance(): PDO {
public static function getInstance(): PDO
{
if (self::$instance === null) {
try {
self::$instance = new PDO(

View File

@@ -1,4 +1,5 @@
{
"_comment": "Web Crawler - Composer Configuration | Copyright (c) 2025 Martin Kiesewetter <mki@kies-media.de> | https://kies-media.de",
"name": "web-crawler/app",
"description": "Web Crawler Application with Parallel Processing",
"type": "project",
@@ -9,7 +10,9 @@
"symfony/css-selector": "^7.0"
},
"require-dev": {
"phpunit/phpunit": "^11.0"
"phpunit/phpunit": "^11.0",
"phpstan/phpstan": "^2.1",
"squizlabs/php_codesniffer": "^4.0"
},
"autoload": {
"psr-4": {
@@ -22,6 +25,9 @@
}
},
"scripts": {
"test": "phpunit"
"test": "phpunit",
"phpstan": "phpstan analyse -c ../phpstan.neon --memory-limit=512M",
"phpcs": "phpcs --standard=PSR12 --ignore=/var/www/html/vendor /var/www/html /var/www/tests",
"phpcbf": "phpcbf --standard=PSR12 --ignore=/var/www/html/vendor /var/www/html /var/www/tests"
}
}

134
src/composer.lock generated
View File

@@ -4,7 +4,7 @@
"Read more about it at https://getcomposer.org/doc/01-basic-usage.md#installing-dependencies",
"This file is @generated automatically"
],
"content-hash": "96376d6cdbd0e0665e091abe3e0ef8d8",
"content-hash": "bb0d5fc291c18a44bfc693b94b302357",
"packages": [
{
"name": "guzzlehttp/guzzle",
@@ -1211,6 +1211,59 @@
},
"time": "2022-02-21T01:04:05+00:00"
},
{
"name": "phpstan/phpstan",
"version": "2.1.30",
"dist": {
"type": "zip",
"url": "https://api.github.com/repos/phpstan/phpstan/zipball/a4a7f159927983dd4f7c8020ed227d80b7f39d7d",
"reference": "a4a7f159927983dd4f7c8020ed227d80b7f39d7d",
"shasum": ""
},
"require": {
"php": "^7.4|^8.0"
},
"conflict": {
"phpstan/phpstan-shim": "*"
},
"bin": [
"phpstan",
"phpstan.phar"
],
"type": "library",
"autoload": {
"files": [
"bootstrap.php"
]
},
"notification-url": "https://packagist.org/downloads/",
"license": [
"MIT"
],
"description": "PHPStan - PHP Static Analysis Tool",
"keywords": [
"dev",
"static analysis"
],
"support": {
"docs": "https://phpstan.org/user-guide/getting-started",
"forum": "https://github.com/phpstan/phpstan/discussions",
"issues": "https://github.com/phpstan/phpstan/issues",
"security": "https://github.com/phpstan/phpstan/security/policy",
"source": "https://github.com/phpstan/phpstan-src"
},
"funding": [
{
"url": "https://github.com/ondrejmirtes",
"type": "github"
},
{
"url": "https://github.com/phpstan",
"type": "github"
}
],
"time": "2025-10-02T16:07:52+00:00"
},
{
"name": "phpunit/php-code-coverage",
"version": "11.0.11",
@@ -2641,6 +2694,85 @@
],
"time": "2024-10-09T05:16:32+00:00"
},
{
"name": "squizlabs/php_codesniffer",
"version": "4.0.0",
"source": {
"type": "git",
"url": "https://github.com/PHPCSStandards/PHP_CodeSniffer.git",
"reference": "06113cfdaf117fc2165f9cd040bd0f17fcd5242d"
},
"dist": {
"type": "zip",
"url": "https://api.github.com/repos/PHPCSStandards/PHP_CodeSniffer/zipball/06113cfdaf117fc2165f9cd040bd0f17fcd5242d",
"reference": "06113cfdaf117fc2165f9cd040bd0f17fcd5242d",
"shasum": ""
},
"require": {
"ext-simplexml": "*",
"ext-tokenizer": "*",
"ext-xmlwriter": "*",
"php": ">=7.2.0"
},
"require-dev": {
"phpunit/phpunit": "^8.4.0 || ^9.3.4 || ^10.5.32 || 11.3.3 - 11.5.28 || ^11.5.31"
},
"bin": [
"bin/phpcbf",
"bin/phpcs"
],
"type": "library",
"notification-url": "https://packagist.org/downloads/",
"license": [
"BSD-3-Clause"
],
"authors": [
{
"name": "Greg Sherwood",
"role": "Former lead"
},
{
"name": "Juliette Reinders Folmer",
"role": "Current lead"
},
{
"name": "Contributors",
"homepage": "https://github.com/PHPCSStandards/PHP_CodeSniffer/graphs/contributors"
}
],
"description": "PHP_CodeSniffer tokenizes PHP files and detects violations of a defined set of coding standards.",
"homepage": "https://github.com/PHPCSStandards/PHP_CodeSniffer",
"keywords": [
"phpcs",
"standards",
"static analysis"
],
"support": {
"issues": "https://github.com/PHPCSStandards/PHP_CodeSniffer/issues",
"security": "https://github.com/PHPCSStandards/PHP_CodeSniffer/security/policy",
"source": "https://github.com/PHPCSStandards/PHP_CodeSniffer",
"wiki": "https://github.com/PHPCSStandards/PHP_CodeSniffer/wiki"
},
"funding": [
{
"url": "https://github.com/PHPCSStandards",
"type": "github"
},
{
"url": "https://github.com/jrfnl",
"type": "github"
},
{
"url": "https://opencollective.com/php_codesniffer",
"type": "open_collective"
},
{
"url": "https://thanks.dev/u/gh/phpcsstandards",
"type": "thanks_dev"
}
],
"time": "2025-09-15T11:28:58+00:00"
},
{
"name": "staabm/side-effects-detector",
"version": "1.0.5",

View File

@@ -1,6 +1,14 @@
#!/usr/bin/env php
<?php
/**
* Web Crawler - Background Worker
*
* @copyright Copyright (c) 2025 Martin Kiesewetter
* @author Martin Kiesewetter <mki@kies-media.de>
* @link https://kies-media.de
*/
require_once __DIR__ . '/vendor/autoload.php';
use App\Database;

View File

@@ -1,9 +1,28 @@
<!DOCTYPE html>
<!--
/**
* Web Crawler - Main Interface
*
* @copyright Copyright (c) 2025 Martin Kiesewetter
* @author Martin Kiesewetter <mki@kies-media.de>
* @link https://kies-media.de
*/
-->
<html lang="de">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Web Crawler</title>
<!-- jQuery -->
<script src="https://code.jquery.com/jquery-3.7.1.min.js"></script>
<!-- DataTables CSS -->
<link rel="stylesheet" href="https://cdn.datatables.net/1.13.7/css/jquery.dataTables.min.css">
<!-- DataTables JS -->
<script src="https://cdn.datatables.net/1.13.7/js/jquery.dataTables.min.js"></script>
<style>
* {
margin: 0;
@@ -198,6 +217,58 @@
text-overflow: ellipsis;
white-space: nowrap;
}
/* DataTables Styling */
.dataTables_wrapper {
padding: 20px 0;
}
.dataTables_filter input {
padding: 8px;
border: 2px solid #e0e0e0;
border-radius: 6px;
margin-left: 10px;
}
.dataTables_length select {
padding: 6px;
border: 2px solid #e0e0e0;
border-radius: 6px;
margin: 0 10px;
}
.dataTables_info {
padding-top: 10px;
color: #7f8c8d;
}
.dataTables_paginate {
padding-top: 10px;
}
.dataTables_paginate .paginate_button {
padding: 6px 12px;
margin: 0 2px;
border: 1px solid #e0e0e0;
border-radius: 4px;
background: white;
cursor: pointer;
}
.dataTables_paginate .paginate_button.current {
background: #3498db;
color: white !important;
border-color: #3498db;
}
.dataTables_paginate .paginate_button:hover {
background: #ecf0f1;
}
.dataTables_paginate .paginate_button.disabled {
cursor: not-allowed;
opacity: 0.5;
}
</style>
</head>
<body>
@@ -214,7 +285,7 @@
<div class="card">
<h2>Crawl Jobs</h2>
<table id="jobsTable">
<table id="jobsTable" class="display">
<thead>
<tr>
<th>ID</th>
@@ -241,10 +312,13 @@
<div class="tabs">
<button class="tab active" onclick="switchTab('pages')">Seiten</button>
<button class="tab" onclick="switchTab('links')">Links</button>
<button class="tab" onclick="switchTab('broken')">Broken Links</button>
<button class="tab" onclick="switchTab('redirects')">Redirects</button>
<button class="tab" onclick="switchTab('seo')">SEO Analysis</button>
</div>
<div class="tab-content active" id="pages-tab">
<table>
<table id="pagesTable" class="display">
<thead>
<tr>
<th>URL</th>
@@ -260,7 +334,7 @@
</div>
<div class="tab-content" id="links-tab">
<table>
<table id="linksTable" class="display">
<thead>
<tr>
<th>Von</th>
@@ -275,6 +349,62 @@
</tbody>
</table>
</div>
<div class="tab-content" id="broken-tab">
<table id="brokenTable" class="display">
<thead>
<tr>
<th>URL</th>
<th>Status Code</th>
<th>Titel</th>
<th>Gecrawlt</th>
</tr>
</thead>
<tbody id="brokenBody">
<tr><td colspan="4" class="loading">Keine defekten Links gefunden</td></tr>
</tbody>
</table>
</div>
<div class="tab-content" id="redirects-tab">
<h3>Redirect Statistics</h3>
<div id="redirectStats" class="stats" style="margin-bottom: 20px;"></div>
<table id="redirectsTable" class="display">
<thead>
<tr>
<th>URL</th>
<th>Redirect To</th>
<th>Status Code</th>
<th>Redirect Count</th>
<th>Type</th>
</tr>
</thead>
<tbody id="redirectsBody">
<tr><td colspan="5" class="loading">Keine Redirects gefunden</td></tr>
</tbody>
</table>
</div>
<div class="tab-content" id="seo-tab">
<h3>SEO Issues</h3>
<div id="seoStats" style="margin-bottom: 20px;"></div>
<table id="seoTable" class="display">
<thead>
<tr>
<th>URL</th>
<th>Title (Länge)</th>
<th>Meta Description (Länge)</th>
<th>Issues</th>
</tr>
</thead>
<tbody id="seoIssuesBody">
<tr><td colspan="4" class="loading">Keine SEO-Probleme gefunden</td></tr>
</tbody>
</table>
<h3 style="margin-top: 30px;">Duplicate Content</h3>
<div id="seoDuplicatesBody"></div>
</div>
</div>
</div>
</div>
@@ -312,12 +442,19 @@
}
}
let jobsDataTable = null;
async function loadJobs() {
try {
const response = await fetch('/api.php?action=jobs');
const data = await response.json();
if (data.success) {
// Destroy existing DataTable if it exists
if (jobsDataTable) {
jobsDataTable.destroy();
}
const tbody = document.getElementById('jobsBody');
tbody.innerHTML = data.jobs.map(job => `
<tr>
@@ -329,10 +466,30 @@
<td>${job.started_at || '-'}</td>
<td>
<button class="action-btn" onclick="viewJob(${job.id})">Ansehen</button>
<button class="action-btn" onclick="recrawlJob(${job.id}, '${job.domain}')">Recrawl</button>
<button class="action-btn" onclick="deleteJob(${job.id})">Löschen</button>
</td>
</tr>
`).join('');
// Initialize DataTable
jobsDataTable = $('#jobsTable').DataTable({
pageLength: 25,
order: [[0, 'desc']],
language: {
search: 'Suchen:',
lengthMenu: 'Zeige _MENU_ Einträge',
info: 'Zeige _START_ bis _END_ von _TOTAL_ Einträgen',
infoEmpty: 'Keine Einträge verfügbar',
infoFiltered: '(gefiltert von _MAX_ Einträgen)',
paginate: {
first: 'Erste',
last: 'Letzte',
next: 'Nächste',
previous: 'Vorherige'
}
}
});
}
} catch (e) {
console.error('Fehler beim Laden der Jobs:', e);
@@ -404,6 +561,10 @@
const pagesResponse = await fetch(`/api.php?action=pages&job_id=${currentJobId}`);
const pagesData = await pagesResponse.json();
if ($.fn.DataTable.isDataTable('#pagesTable')) {
$('#pagesTable').DataTable().destroy();
}
if (pagesData.success && pagesData.pages.length > 0) {
document.getElementById('pagesBody').innerHTML = pagesData.pages.map(page => `
<tr>
@@ -413,12 +574,33 @@
<td>${page.crawled_at}</td>
</tr>
`).join('');
$('#pagesTable').DataTable({
pageLength: 50,
language: {
search: 'Suchen:',
lengthMenu: 'Zeige _MENU_ Einträge',
info: 'Zeige _START_ bis _END_ von _TOTAL_ Einträgen',
infoEmpty: 'Keine Einträge verfügbar',
infoFiltered: '(gefiltert von _MAX_ Einträgen)',
paginate: {
first: 'Erste',
last: 'Letzte',
next: 'Nächste',
previous: 'Vorherige'
}
}
});
}
// Load links
const linksResponse = await fetch(`/api.php?action=links&job_id=${currentJobId}`);
const linksData = await linksResponse.json();
if ($.fn.DataTable.isDataTable('#linksTable')) {
$('#linksTable').DataTable().destroy();
}
if (linksData.success && linksData.links.length > 0) {
document.getElementById('linksBody').innerHTML = linksData.links.map(link => `
<tr>
@@ -429,6 +611,205 @@
<td>${link.is_internal ? 'Intern' : '<span class="external">Extern</span>'}</td>
</tr>
`).join('');
$('#linksTable').DataTable({
pageLength: 50,
language: {
search: 'Suchen:',
lengthMenu: 'Zeige _MENU_ Einträge',
info: 'Zeige _START_ bis _END_ von _TOTAL_ Einträgen',
infoEmpty: 'Keine Einträge verfügbar',
infoFiltered: '(gefiltert von _MAX_ Einträgen)',
paginate: {
first: 'Erste',
last: 'Letzte',
next: 'Nächste',
previous: 'Vorherige'
}
}
});
}
// Load broken links
const brokenResponse = await fetch(`/api.php?action=broken-links&job_id=${currentJobId}`);
const brokenData = await brokenResponse.json();
if ($.fn.DataTable.isDataTable('#brokenTable')) {
$('#brokenTable').DataTable().destroy();
}
if (brokenData.success && brokenData.broken_links.length > 0) {
document.getElementById('brokenBody').innerHTML = brokenData.broken_links.map(page => `
<tr>
<td class="url-cell" title="${page.url}">${page.url}</td>
<td><span class="status failed">${page.status_code || 'Error'}</span></td>
<td>${page.title || '-'}</td>
<td>${page.crawled_at}</td>
</tr>
`).join('');
$('#brokenTable').DataTable({
pageLength: 25,
language: {
search: 'Suchen:',
lengthMenu: 'Zeige _MENU_ Einträge',
info: 'Zeige _START_ bis _END_ von _TOTAL_ Einträgen',
infoEmpty: 'Keine Einträge verfügbar',
infoFiltered: '(gefiltert von _MAX_ Einträgen)',
paginate: {
first: 'Erste',
last: 'Letzte',
next: 'Nächste',
previous: 'Vorherige'
}
}
});
} else {
document.getElementById('brokenBody').innerHTML = '<tr><td colspan="4" class="loading">Keine defekten Links gefunden</td></tr>';
}
// Load SEO analysis
const seoResponse = await fetch(`/api.php?action=seo-analysis&job_id=${currentJobId}`);
const seoData = await seoResponse.json();
if (seoData.success) {
// SEO Stats
document.getElementById('seoStats').innerHTML = `
<div class="stat-box">
<div class="stat-label">Total Pages</div>
<div class="stat-value">${seoData.total_pages}</div>
</div>
<div class="stat-box">
<div class="stat-label">Pages with Issues</div>
<div class="stat-value">${seoData.issues.length}</div>
</div>
<div class="stat-box">
<div class="stat-label">Duplicates Found</div>
<div class="stat-value">${seoData.duplicates.length}</div>
</div>
`;
// SEO Issues
if ($.fn.DataTable.isDataTable('#seoTable')) {
$('#seoTable').DataTable().destroy();
}
if (seoData.issues.length > 0) {
document.getElementById('seoIssuesBody').innerHTML = seoData.issues.map(item => `
<tr>
<td class="url-cell" title="${item.url}">${item.url}</td>
<td>${item.title || '-'} (${item.title_length})</td>
<td>${item.meta_description ? item.meta_description.substring(0, 50) + '...' : '-'} (${item.meta_length})</td>
<td><span class="nofollow">${item.issues.join(', ')}</span></td>
</tr>
`).join('');
$('#seoTable').DataTable({
pageLength: 25,
language: {
search: 'Suchen:',
lengthMenu: 'Zeige _MENU_ Einträge',
info: 'Zeige _START_ bis _END_ von _TOTAL_ Einträgen',
infoEmpty: 'Keine Einträge verfügbar',
infoFiltered: '(gefiltert von _MAX_ Einträgen)',
paginate: {
first: 'Erste',
last: 'Letzte',
next: 'Nächste',
previous: 'Vorherige'
}
}
});
} else {
document.getElementById('seoIssuesBody').innerHTML = '<tr><td colspan="4" class="loading">Keine SEO-Probleme gefunden</td></tr>';
}
// Duplicates
if (seoData.duplicates.length > 0) {
document.getElementById('seoDuplicatesBody').innerHTML = seoData.duplicates.map(dup => `
<div class="stat-box" style="margin-bottom: 15px;">
<div class="stat-label">Duplicate ${dup.type}</div>
<div style="font-size: 14px; margin: 10px 0;"><strong>${dup.content}</strong></div>
<div style="font-size: 12px;">Found on ${dup.urls.length} pages:</div>
<ul style="margin-top: 5px; font-size: 12px;">
${dup.urls.map(url => `<li>${url}</li>`).join('')}
</ul>
</div>
`).join('');
} else {
document.getElementById('seoDuplicatesBody').innerHTML = '<p>Keine doppelten Inhalte gefunden</p>';
}
}
// Load redirects
const redirectsResponse = await fetch(`/api.php?action=redirects&job_id=${currentJobId}`);
const redirectsData = await redirectsResponse.json();
if (redirectsData.success) {
const stats = redirectsData.stats;
// Redirect Stats
document.getElementById('redirectStats').innerHTML = `
<div class="stat-box">
<div class="stat-label">Total Redirects</div>
<div class="stat-value">${stats.total}</div>
</div>
<div class="stat-box">
<div class="stat-label">Permanent (301/308)</div>
<div class="stat-value">${stats.permanent}</div>
</div>
<div class="stat-box">
<div class="stat-label">Temporary (302/303/307)</div>
<div class="stat-value">${stats.temporary}</div>
</div>
<div class="stat-box">
<div class="stat-label">Excessive (>${stats.threshold})</div>
<div class="stat-value" style="color: ${stats.excessive > 0 ? '#e74c3c' : '#27ae60'}">${stats.excessive}</div>
<div class="stat-sublabel">threshold: ${stats.threshold}</div>
</div>
`;
// Redirect Table
if ($.fn.DataTable.isDataTable('#redirectsTable')) {
$('#redirectsTable').DataTable().destroy();
}
if (redirectsData.redirects.length > 0) {
document.getElementById('redirectsBody').innerHTML = redirectsData.redirects.map(redirect => {
const isExcessive = redirect.redirect_count > stats.threshold;
const isPermRedirect = redirect.status_code == 301 || redirect.status_code == 308;
const redirectType = isPermRedirect ? 'Permanent' : 'Temporary';
return `
<tr style="${isExcessive ? 'background-color: #fff3cd;' : ''}">
<td class="url-cell" title="${redirect.url}">${redirect.url}</td>
<td class="url-cell" title="${redirect.redirect_url || '-'}">${redirect.redirect_url || '-'}</td>
<td><span class="status ${isPermRedirect ? 'completed' : 'running'}">${redirect.status_code}</span></td>
<td><strong ${isExcessive ? 'style="color: #e74c3c;"' : ''}>${redirect.redirect_count}</strong></td>
<td>${redirectType}</td>
</tr>
`;
}).join('');
$('#redirectsTable').DataTable({
pageLength: 25,
language: {
search: 'Suchen:',
lengthMenu: 'Zeige _MENU_ Einträge',
info: 'Zeige _START_ bis _END_ von _TOTAL_ Einträgen',
infoEmpty: 'Keine Einträge verfügbar',
infoFiltered: '(gefiltert von _MAX_ Einträgen)',
paginate: {
first: 'Erste',
last: 'Letzte',
next: 'Nächste',
previous: 'Vorherige'
}
}
});
} else {
document.getElementById('redirectsBody').innerHTML = '<tr><td colspan="5" class="loading">Keine Redirects gefunden</td></tr>';
}
}
// Update jobs table
@@ -463,6 +844,31 @@
}
}
async function recrawlJob(jobId, domain) {
if (!confirm('Job-Ergebnisse löschen und neu crawlen?')) return;
const formData = new FormData();
formData.append('job_id', jobId);
formData.append('domain', domain);
try {
const response = await fetch('/api.php?action=recrawl', {
method: 'POST',
body: formData
});
const data = await response.json();
if (data.success) {
loadJobs();
alert('Recrawl gestartet! Job ID: ' + data.job_id);
} else {
alert('Fehler: ' + data.error);
}
} catch (e) {
alert('Fehler beim Recrawl: ' + e.message);
}
}
function switchTab(tab) {
document.querySelectorAll('.tab').forEach(t => t.classList.remove('active'));
document.querySelectorAll('.tab-content').forEach(c => c.classList.remove('active'));

View File

@@ -18,7 +18,8 @@ class CrawlerIntegrationTest extends TestCase
// Create a test job
$stmt = $this->db->prepare("INSERT INTO crawl_jobs (domain, status) VALUES (?, 'pending')");
$stmt->execute(['https://httpbin.org']);
$this->testJobId = $this->db->lastInsertId();
$lastId = $this->db->lastInsertId();
$this->testJobId = is_numeric($lastId) ? (int)$lastId : 0;
}
protected function tearDown(): void

View File

@@ -17,7 +17,8 @@ class CrawlerTest extends TestCase
// Create a test job
$stmt = $db->prepare("INSERT INTO crawl_jobs (domain, status) VALUES (?, 'pending')");
$stmt->execute(['https://example.com']);
$this->testJobId = $db->lastInsertId();
$lastId = $db->lastInsertId();
$this->testJobId = is_numeric($lastId) ? (int)$lastId : 0;
}
protected function tearDown(): void

View File

@@ -42,6 +42,7 @@ class DatabaseTest extends TestCase
{
$db = Database::getInstance();
$stmt = $db->query('SELECT 1 as test');
$this->assertNotFalse($stmt, 'Query failed');
$result = $stmt->fetch();
$this->assertEquals(['test' => 1], $result);

View File

@@ -1,457 +0,0 @@
<?php
/**
* Klasse uebernimmt das Crawlen von Websites und persistiert Metadaten in MySQL.
*/
class webanalyse
{
/**
* @var mysqli|null Verbindung zur Screaming Frog Datenbank.
*/
var $db;
/**
* Initialisiert die Datenbankverbindung fuer die Crawl-Session.
*/
function __construct()
{
$this->db = mysqli_connect("localhost", "root", "", "screaming_frog");
}
/**
* Holt eine einzelne URL via cURL und liefert Response-Metadaten.
*
* @param string $url Zieladresse fuer den Abruf.
* @return array<string,mixed> Antwortdaten oder ein "error"-Schluessel.
*/
function getWebsite($url)
{
// cURL-Session initialisieren
$ch = curl_init();
// cURL-Optionen setzen
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // Antwort als String zurückgeben
curl_setopt($ch, CURLOPT_HEADER, true); // Header in der Antwort einschließen
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // Weiterleitungen folgen
curl_setopt($ch, CURLOPT_TIMEOUT, 30); // Timeout nach 30 Sekunden
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'); // User Agent setzen
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // SSL-Zertifikat nicht prüfen (nur für Tests)
// Anfrage ausführen
$response = curl_exec($ch);
// Fehler überprüfen
if (curl_errno($ch)) {
$error = curl_error($ch);
curl_close($ch);
return ['error' => $error];
}
// Informationen abrufen
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$headerSize = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
$totalTime = curl_getinfo($ch, CURLINFO_TOTAL_TIME);
$effectiveUrl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
// cURL-Session schließen
curl_close($ch);
// Header und Body trennen
$headers = substr($response, 0, $headerSize);
$body = substr($response, $headerSize);
// Header in Array umwandeln
$headerLines = explode("\r\n", trim($headers));
$parsedHeaders = [];
foreach ($headerLines as $line) {
if (strpos($line, ':') !== false) {
list($key, $value) = explode(':', $line, 2);
$parsedHeaders[trim($key)] = trim($value);
}
}
return [
'url' => $effectiveUrl,
'status_code' => $httpCode,
// 'headers_raw' => $headers,
'headers_parsed' => $parsedHeaders,
'body' => $body,
'response_time' => $totalTime,
'body_size' => strlen($body)
];
}
/**
* Ruft mehrere URLs parallel via curl_multi ab.
*
* @param array<int,string> $urls Liste von Ziel-URLs.
* @return array<string,array<string,mixed>> Antworten je URL.
*/
function getMultipleWebsites($urls)
{
$results = [];
$curlHandles = [];
$multiHandle = curl_multi_init();
// Einzelne cURL-Handles für jede URL erstellen
foreach ($urls as $url) {
$ch = curl_init();
// cURL-Optionen setzen (gleich wie bei getWebsite)
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
// Handle zum Multi-Handle hinzufügen
curl_multi_add_handle($multiHandle, $ch);
$curlHandles[$url] = $ch;
}
// Alle Anfragen parallel ausführen
$running = null;
do {
curl_multi_exec($multiHandle, $running);
curl_multi_select($multiHandle);
} while ($running > 0);
// Ergebnisse verarbeiten
foreach ($urls as $url) {
$ch = $curlHandles[$url];
$response = curl_multi_getcontent($ch);
// Fehler überprüfen
if (curl_errno($ch)) {
$error = curl_error($ch);
$results[$url] = ['error' => $error];
} else {
// Informationen abrufen
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$headerSize = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
$totalTime = curl_getinfo($ch, CURLINFO_TOTAL_TIME);
$effectiveUrl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
// Header und Body trennen
$headers = substr($response, 0, $headerSize);
$body = substr($response, $headerSize);
// Header in Array umwandeln
$headerLines = explode("\r\n", trim($headers));
$parsedHeaders = [];
foreach ($headerLines as $line) {
if (strpos($line, ':') !== false) {
list($key, $value) = explode(':', $line, 2);
$parsedHeaders[trim($key)] = trim($value);
}
}
$results[$url] = [
'url' => $effectiveUrl,
'status_code' => $httpCode,
'headers_parsed' => $parsedHeaders,
'body' => $body,
'response_time' => $totalTime,
'body_size' => strlen($body)
];
}
// Handle aus Multi-Handle entfernen und schließen
curl_multi_remove_handle($multiHandle, $ch);
curl_close($ch);
}
// Multi-Handle schließen
curl_multi_close($multiHandle);
return $results;
}
/**
* Persistiert Response-Daten und stoesst die Analyse der gefundenen Links an.
*
* @param int $crawlID Identifier der Crawl-Session.
* @param string $url Ursprung-URL, deren Antwort verarbeitet wird.
* @param array<string,mixed> $data Ergebnis der HTTP-Abfrage.
* @return void
*/
function processResults(int $crawlID, string $url, array $data)
{
if (!isset($data['error'])) {
$status_code = $data['status_code'];
$response_time = $data['response_time'];
$body_size = $data['body_size'];
$date = date('Y-m-d H:i:s');
$body = $data['body'];
$sql = "UPDATE urls SET
status_code = " . $status_code . ",
response_time = " . ($response_time * 1000) . ",
body_size = " . $body_size . ",
date = now(),
body = '" . $this->db->real_escape_string($body) . "'
WHERE url = '" . $this->db->real_escape_string($url) . "' AND crawl_id = " . $crawlID . " LIMIT 1";
// echo $sql;
$this->db->query($sql);
} else {
// Handle error case if needed
echo "Fehler bei der Analyse von $url: " . $data['error'] . "\n";
}
$this->findNewUrls($crawlID, $body, $url);
}
/**
* Extrahiert Links aus einer Antwort und legt neue URL-Datensaetze an.
*
* @param int $crawlID Identifier der Crawl-Session.
* @param string $body HTML-Koerper der Antwort.
* @param string $url Bearbeitete URL, dient als Kontext fuer relative Links.
* @return void
*/
function findNewUrls(int $crawlID, string $body, string $url) {
$links = $this->extractLinks($body, $url);
$temp = $this->db->query("select id from urls where url = '".$this->db->real_escape_string($url)."' and crawl_id = ".$crawlID." LIMIT 1")->fetch_all(MYSQLI_ASSOC);
$vonUrlId = $temp[0]['id'];
$this->db->query("delete from links where von = ".$vonUrlId);
foreach($links as $l) {
$u = $this->db->query("insert ignore into urls (url, crawl_id) values ('".$this->db->real_escape_string($l['absolute_url'])."',".$crawlID.")");
$id = $this->db->insert_id;
if ($id === 0) {
$qwer = $this->db->query("select id from urls where url = '".$this->db->real_escape_string($l['absolute_url'])."' and crawl_id = ".$crawlID." LIMIT 1")->fetch_all(MYSQLI_ASSOC);
$id = $qwer[0]['id'];
}
$sql_links = "insert ignore into links (von, nach, linktext, dofollow) values (
".$vonUrlId.",
".$id.",
'".$this->db->real_escape_string(mb_convert_encoding($l['text'],"UTF-8"))."',
".(strstr($l['rel']??"", 'nofollow') === false ? 1 : 0)."
)";
echo $sql_links;
$u = $this->db->query($sql_links);
}
print_r($links);
}
/**
* Startet einen Crawl-Durchlauf fuer unbehandelte URLs.
*
* @param int $crawlID Identifier der Crawl-Session.
* @return void
*/
function doCrawl(int $crawlID)
{
$urls2toCrawl = $this->db->query("select * from urls where crawl_id = " . $crawlID . " and date is null LIMIT 2")->fetch_all(MYSQLI_ASSOC); // and date is not null
$urls = [];
foreach ($urls2toCrawl as $u) {
$urls[] = $u['url'];
}
$multipleResults = $this->getMultipleWebsites($urls);
// print_r($multipleResults);
foreach ($multipleResults as $url => $data) {
$this->processResults($crawlID, $url, $data);
}
}
/**
* Parst HTML-Inhalt und liefert eine strukturierte Liste gefundener Links.
*
* @param string $html Rohes HTML-Dokument.
* @param string $baseUrl Basis-URL fuer die Aufloesung relativer Pfade.
* @return array<int,array<string,mixed>> Gesammelte Linkdaten.
*/
function extractLinks($html, $baseUrl = '')
{
$links = [];
// DOMDocument erstellen und HTML laden
$dom = new DOMDocument();
// Fehlerbehandlung für ungültiges HTML
libxml_use_internal_errors(true);
$dom->loadHTML('<?xml encoding="UTF-8">' . $html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
libxml_clear_errors();
// Alle <a> Tags finden
$aTags = $dom->getElementsByTagName('a');
foreach ($aTags as $index => $aTag) {
$href = $aTag->getAttribute('href');
$text = trim($aTag->textContent);
$rel = $aTag->getAttribute('rel');
$title = $aTag->getAttribute('title');
$target = $aTag->getAttribute('target');
// Nur Links mit href-Attribut
if (!empty($href)) {
// Relative URLs zu absoluten URLs konvertieren
$absoluteUrl = $href;
if (!empty($baseUrl) && !preg_match('/^https?:\/\//', $href)) {
$absoluteUrl = rtrim($baseUrl, '/') . '/' . ltrim($href, '/');
}
$links[] = [
'index' => $index + 1,
'href' => $href,
'absolute_url' => $absoluteUrl,
'text' => $text,
'rel' => $rel ?: null,
'title' => $title ?: null,
'target' => $target ?: null,
'is_external' => $this->isExternalLink($href, $baseUrl),
'link_type' => $this->getLinkType($href),
'is_internal' => $this->isInternalLink($href, $baseUrl)?1:0
];
}
}
return $links;
}
/**
* Prueft, ob ein Link aus Sicht der Basis-URL extern ist.
*
* @param string $href Ziel des Links.
* @param string $baseUrl Ausgangsadresse zur Domainabgleichung.
* @return bool|null True fuer extern, false fuer intern, null falls undefiniert.
*/
private function isExternalLink($href, $baseUrl)
{
if (empty($baseUrl)) return null;
// Relative Links sind intern
if (!preg_match('/^https?:\/\//', $href)) {
return false;
}
$baseDomain = parse_url($baseUrl, PHP_URL_HOST);
$linkDomain = parse_url($href, PHP_URL_HOST);
return $baseDomain !== $linkDomain;
}
/**
* Prueft, ob ein Link derselben Domain wie die Basis-URL entspricht.
*
* @param string $href Ziel des Links.
* @param string $baseUrl Ausgangsadresse zur Domainabgleichung.
* @return bool|null True fuer intern, false fuer extern, null falls undefiniert.
*/
private function isInternalLink($href, $baseUrl)
{
if (empty($baseUrl)) return null;
// Relative Links sind intern
if (!preg_match('/^https?:\/\//', $href)) {
return true;
}
$baseDomain = parse_url($baseUrl, PHP_URL_HOST);
$linkDomain = parse_url($href, PHP_URL_HOST);
return $baseDomain === $linkDomain;
}
/**
* Leitet den Link-Typ anhand gaengiger Protokolle und Muster ab.
*
* @param string $href Ziel des Links.
* @return string Beschreibender Typ wie "absolute" oder "email".
*/
private function getLinkType($href)
{
if (empty($href)) return 'empty';
if (strpos($href, 'mailto:') === 0) return 'email';
if (strpos($href, 'tel:') === 0) return 'phone';
if (strpos($href, '#') === 0) return 'anchor';
if (strpos($href, 'javascript:') === 0) return 'javascript';
if (preg_match('/^https?:\/\//', $href)) return 'absolute';
return 'relative';
}
/**
* Gruppiert Links anhand ihres vorab bestimmten Typs.
*
* @param array<int,array<string,mixed>> $links Liste der extrahierten Links.
* @return array<string,array<int,array<string,mixed>>> Links nach Typ gruppiert.
*/
function groupLinksByType($links)
{
$grouped = [];
foreach ($links as $link) {
$type = $link['link_type'];
if (!isset($grouped[$type])) {
$grouped[$type] = [];
}
$grouped[$type][] = $link;
}
return $grouped;
}
}