Trend Filtering System¶

Overview¶

The MiPörög trend aggregator includes a smart filtering system to exclude low-quality navigational searches that are not suitable for news article generation. This prevents the system from wasting resources on searches like "m4 sport", "facebook bejelentkezés", or "youtube" where users are simply trying to navigate to a website.

How It Works¶

The filtering system uses a hybrid approach with three layers:

1. Exact String Matching (Fast)¶

Checks against a curated list of 50+ common navigational searches
Includes TV channels, social media sites, search engines, email providers
Case-insensitive matching
Performance: ~0.001ms per check

2. Pattern Matching (Fast)¶

Regex patterns for common navigational query types:
Domain URLs: example.hu, www.example.com
Login searches: facebook bejelentkezés, gmail login
Live streams: m4 élő, rtl online
Performance: ~0.01ms per check

3. Semantic Similarity (Slower, More Robust)¶

Uses sentence embeddings to catch variations
Compares trend against known navigational searches
Catches queries like:
"m4 sport online" → similar to "m4 sport"
"m4 tv élő adás" → similar to "m4 élő"
"facebook belépés" → similar to "facebook bejelentkezés"
Performance: ~50-100ms per check (first run downloads ~120MB model)
Threshold: 0.75 cosine similarity

Filtered Searches¶

TV Channels¶

m4, m4sport, m4 sport, m4 élő, m4 sport élő
rtl, rtl klub, rtl most, rtl+, rtl plus
tv2, tv2 play, tv2 élő
duna tv, duna world

facebook, fb, facebook bejelentkezés, facebook login
youtube, yt, youtube.com
instagram, insta, ig
twitter, x.com
tiktok, tik tok
linkedin

Search Engines & Email¶

google, google.com, google translate, google fordító
gmail, gmail bejelentkezés
freemail, freemail bejelentkezés
citromail, citromail bejelentkezés

News Sites¶

index.hu, index
origo, origo.hu
444, 444.hu
telex, telex.hu
hvg, hvg.hu
portfolio, portfolio.hu

Other¶

időjárás, időkép, idokep
translate, fordító
map, térkép, google maps

Configuration¶

Adjusting Similarity Threshold¶

Edit src/filters.py:

# Lower = more aggressive filtering (may catch false positives)
# Higher = less aggressive filtering (may miss variations)
SIMILARITY_THRESHOLD = 0.75  # Default: 0.75

Adding New Filters¶

Add to the DOMAIN_SEARCHES set in src/filters.py:

DOMAIN_SEARCHES = {
    # ... existing entries ...
    "new_site", "new_site.hu",
}

Enabling/Disabling Semantic Matching¶

Semantic matching is disabled by default for performance. To enable it:

Option 1: Environment Variable (Recommended)

# In .env or GitHub Actions secrets
USE_SEMANTIC_FILTERING=true

Option 2: Code Change

# In src/config.py
USE_SEMANTIC_FILTERING = True  # Change from os.getenv(...)

Option 3: Per-Call Override

# In src/scraping/google_trends.py
should_filter, filter_reason = should_filter_trend(title_text, use_semantic=True)

Testing¶

Run Unit Tests¶

# Run all filter tests
uv run pytest tests/test_filters.py -v

# Run only fast tests (no semantic matching)
uv run pytest tests/test_filters.py::TestDomainSearchFiltering -v

# Run semantic tests (slow, downloads model)
uv run pytest tests/test_filters.py::TestSemanticFiltering -v -m slow

Manual Testing¶

# Test specific trends
python scripts/test_filter.py

Example output:

Testing trend filtering...
======================================================================
❌ FILTERED     m4                             (domain search)
❌ FILTERED     m4 sport                       (domain search)
❌ FILTERED     facebook                       (domain search)
✅ ALLOWED      Barcelona Real Madrid
✅ ALLOWED      Fradi meccs
✅ ALLOWED      Orbán Viktor
======================================================================

Performance Impact¶

Without Filtering¶

Processes all trends, including navigational searches
Wastes ~$0.05-0.10 per filtered trend on LLM calls
Generates low-quality articles about "how to access m4 sport"

With Filtering (Exact + Pattern Only)¶

Exact + pattern matching: negligible overhead (~0.01ms per trend)
Recommended for CI/CD: Fast, no model downloads
Catches 90%+ of navigational searches
Saves ~$0.10-0.30 per run by avoiding bad trends

With Semantic Filtering (Optional)¶

First run: ~2-3 minutes to download 120MB model
Subsequent runs: ~50-100ms per trend
Catches 95%+ of navigational searches (including variations)
Recommended for local development: More robust, but slower
ROI: Pays for itself after ~10 runs

CI/CD Optimization¶

By default, semantic filtering is disabled in CI/CD to avoid: - Downloading 120MB model on every run (~30-60 seconds) - Additional 50-100ms per trend for embedding computation - Potential memory issues on GitHub Actions runners

The exact + pattern matching approach is sufficient for most cases and adds negligible overhead.

Model Details¶

Embedding Model¶

Name: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Size: ~120MB
Languages: 50+ languages including Hungarian
Speed: ~100 sentences/second on CPU
Quality: Good balance between speed and accuracy

Why This Model?¶

Multilingual support (handles Hungarian + English queries)
Lightweight (fits in memory, fast inference)
Good semantic understanding of navigational intent
Pre-trained on paraphrase detection (perfect for our use case)

Future Improvements¶

Add more patterns: Detect "how to" queries, "download" queries
User feedback: Learn from false positives/negatives
Category-specific filters: Different rules for different categories
Time-based filters: Filter out stale trends (>7 days old)
Popularity threshold: Filter trends with <1K searches

Trend Filtering System¶

Overview¶

How It Works¶

1. Exact String Matching (Fast)¶

2. Pattern Matching (Fast)¶

3. Semantic Similarity (Slower, More Robust)¶

Filtered Searches¶

TV Channels¶

Search Engines & Email¶

News Sites¶

Other¶

Configuration¶

Adjusting Similarity Threshold¶

Adding New Filters¶

Enabling/Disabling Semantic Matching¶

Testing¶

Run Unit Tests¶

Manual Testing¶

Performance Impact¶

Without Filtering¶

With Filtering (Exact + Pattern Only)¶

With Semantic Filtering (Optional)¶

CI/CD Optimization¶

Model Details¶

Embedding Model¶

Why This Model?¶

Future Improvements¶

Troubleshooting¶

False Positives (Good trends filtered)¶

False Negatives (Bad trends not filtered)¶

Slow Performance¶

References¶

Trend Filtering System¶

Overview¶

How It Works¶

1. Exact String Matching (Fast)¶

2. Pattern Matching (Fast)¶

3. Semantic Similarity (Slower, More Robust)¶

Filtered Searches¶

TV Channels¶

Social Media¶

Search Engines & Email¶

News Sites¶

Other¶

Configuration¶

Adjusting Similarity Threshold¶

Adding New Filters¶

Enabling/Disabling Semantic Matching¶

Testing¶

Run Unit Tests¶

Manual Testing¶

Performance Impact¶

Without Filtering¶

With Filtering (Exact + Pattern Only)¶

With Semantic Filtering (Optional)¶

CI/CD Optimization¶

Model Details¶

Embedding Model¶

Why This Model?¶

Future Improvements¶

Troubleshooting¶

False Positives (Good trends filtered)¶

False Negatives (Bad trends not filtered)¶

Slow Performance¶

References¶