Trend Filtering System¶
Overview¶
The MiPörög trend aggregator includes a smart filtering system to exclude low-quality navigational searches that are not suitable for news article generation. This prevents the system from wasting resources on searches like "m4 sport", "facebook bejelentkezés", or "youtube" where users are simply trying to navigate to a website.
How It Works¶
The filtering system uses a hybrid approach with three layers:
1. Exact String Matching (Fast)¶
- Checks against a curated list of 50+ common navigational searches
- Includes TV channels, social media sites, search engines, email providers
- Case-insensitive matching
- Performance: ~0.001ms per check
2. Pattern Matching (Fast)¶
- Regex patterns for common navigational query types:
- Domain URLs:
example.hu,www.example.com - Login searches:
facebook bejelentkezés,gmail login - Live streams:
m4 élő,rtl online - Performance: ~0.01ms per check
3. Semantic Similarity (Slower, More Robust)¶
- Uses sentence embeddings to catch variations
- Compares trend against known navigational searches
- Catches queries like:
- "m4 sport online" → similar to "m4 sport"
- "m4 tv élő adás" → similar to "m4 élő"
- "facebook belépés" → similar to "facebook bejelentkezés"
- Performance: ~50-100ms per check (first run downloads ~120MB model)
- Threshold: 0.75 cosine similarity
Filtered Searches¶
TV Channels¶
- m4, m4sport, m4 sport, m4 élő, m4 sport élő
- rtl, rtl klub, rtl most, rtl+, rtl plus
- tv2, tv2 play, tv2 élő
- duna tv, duna world
Social Media¶
- facebook, fb, facebook bejelentkezés, facebook login
- youtube, yt, youtube.com
- instagram, insta, ig
- twitter, x.com
- tiktok, tik tok
Search Engines & Email¶
- google, google.com, google translate, google fordító
- gmail, gmail bejelentkezés
- freemail, freemail bejelentkezés
- citromail, citromail bejelentkezés
News Sites¶
- index.hu, index
- origo, origo.hu
- 444, 444.hu
- telex, telex.hu
- hvg, hvg.hu
- portfolio, portfolio.hu
Other¶
- időjárás, időkép, idokep
- translate, fordító
- map, térkép, google maps
Configuration¶
Adjusting Similarity Threshold¶
Edit src/filters.py:
# Lower = more aggressive filtering (may catch false positives)
# Higher = less aggressive filtering (may miss variations)
SIMILARITY_THRESHOLD = 0.75 # Default: 0.75
Adding New Filters¶
Add to the DOMAIN_SEARCHES set in src/filters.py:
Enabling/Disabling Semantic Matching¶
Semantic matching is disabled by default for performance. To enable it:
Option 1: Environment Variable (Recommended)
Option 2: Code Change
Option 3: Per-Call Override
# In src/scraping/google_trends.py
should_filter, filter_reason = should_filter_trend(title_text, use_semantic=True)
Testing¶
Run Unit Tests¶
# Run all filter tests
uv run pytest tests/test_filters.py -v
# Run only fast tests (no semantic matching)
uv run pytest tests/test_filters.py::TestDomainSearchFiltering -v
# Run semantic tests (slow, downloads model)
uv run pytest tests/test_filters.py::TestSemanticFiltering -v -m slow
Manual Testing¶
Example output:
Testing trend filtering...
======================================================================
❌ FILTERED m4 (domain search)
❌ FILTERED m4 sport (domain search)
❌ FILTERED facebook (domain search)
✅ ALLOWED Barcelona Real Madrid
✅ ALLOWED Fradi meccs
✅ ALLOWED Orbán Viktor
======================================================================
Performance Impact¶
Without Filtering¶
- Processes all trends, including navigational searches
- Wastes ~$0.05-0.10 per filtered trend on LLM calls
- Generates low-quality articles about "how to access m4 sport"
With Filtering (Exact + Pattern Only)¶
- Exact + pattern matching: negligible overhead (~0.01ms per trend)
- Recommended for CI/CD: Fast, no model downloads
- Catches 90%+ of navigational searches
- Saves ~$0.10-0.30 per run by avoiding bad trends
With Semantic Filtering (Optional)¶
- First run: ~2-3 minutes to download 120MB model
- Subsequent runs: ~50-100ms per trend
- Catches 95%+ of navigational searches (including variations)
- Recommended for local development: More robust, but slower
- ROI: Pays for itself after ~10 runs
CI/CD Optimization¶
By default, semantic filtering is disabled in CI/CD to avoid: - Downloading 120MB model on every run (~30-60 seconds) - Additional 50-100ms per trend for embedding computation - Potential memory issues on GitHub Actions runners
The exact + pattern matching approach is sufficient for most cases and adds negligible overhead.
Model Details¶
Embedding Model¶
- Name:
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 - Size: ~120MB
- Languages: 50+ languages including Hungarian
- Speed: ~100 sentences/second on CPU
- Quality: Good balance between speed and accuracy
Why This Model?¶
- Multilingual support (handles Hungarian + English queries)
- Lightweight (fits in memory, fast inference)
- Good semantic understanding of navigational intent
- Pre-trained on paraphrase detection (perfect for our use case)
Future Improvements¶
- Add more patterns: Detect "how to" queries, "download" queries
- User feedback: Learn from false positives/negatives
- Category-specific filters: Different rules for different categories
- Time-based filters: Filter out stale trends (>7 days old)
- Popularity threshold: Filter trends with <1K searches
Troubleshooting¶
False Positives (Good trends filtered)¶
- Lower
SIMILARITY_THRESHOLD(e.g., 0.70 → 0.80) - Remove overly broad keywords from
DOMAIN_SEARCHES - Check semantic matches in logs
False Negatives (Bad trends not filtered)¶
- Add specific keywords to
DOMAIN_SEARCHES - Add new patterns to
DOMAIN_PATTERNS - Lower
SIMILARITY_THRESHOLD(e.g., 0.75 → 0.70)
Slow Performance¶
- Disable semantic matching:
use_semantic=False - Use GPU if available:
model_kwargs={"device": "cuda"} - Cache embeddings for common queries