Contact
Let's work together
Vertical Search Engine for Promotional Products

Vertical Search Engine for Promotional Products

Custom search engine with integrated crawler, full-text indexation, multi-criteria scoring algorithm and internal ad platform

~2009 - 2013
~4 years
Software Engineer
PHP 5.2/5.3Zend Framework 1.xMySQL 5.1PDOjQueryApacheSVNEclipse PDTGettextcURLHTML4/XHTMLCSS 2.1phpMyAdminUbuntu 10/11

Code Metrics

~15K

PHP + HTML/CSS/JS

Indexation Metrics

22,320

Words indexed

Markets

2

France + Italy

Indexed pages

2,335

Supplier websites

Presentation

General context and project scope

Recherche Publicitaire was a vertical search engine specialized in the promotional and advertising products sector. Developed at European Sourcing, this project provided B2B professionals with a dedicated search tool to find promotional products (pens, mugs, textiles, gadgets, etc.) through an index built by crawling supplier websites.

The project existed in two geographic and linguistic variations:

  • recherche-publicitaire.com - French version, targeting the francophone promotional product market, powered by the Tendance Objet brand
  • oggetti-promozionali.com - Italian version, targeting the Italian promotional product market

The search engine offered full-text search across crawled supplier catalogs, with results ranked by a multi-criteria scoring algorithm combining word position coefficients, occurrence counts, and supplier partner status. An integrated ad platform (text ads and banners) served by manager.europeansourcing.com monetized the traffic.

Global Architecture

Objectives, Context, Stakes & Risks

Strategic challenges and project constraints

Objectives
  • Build a specialized search engine for promotional products - a better alternative to generic Google searches for this niche B2B sector
  • Automatically index supplier catalogs from the European Sourcing network via web crawling
  • Monetize traffic through an integrated internal ad platform (text ads + banners)
  • Cover the international market: France (recherche-publicitaire.com) and Italy (oggetti-promozionali.com)
Context
  • Part of the broader European Sourcing ecosystem (europeansourcing.com, tendanceobjet.com, manager, es_crawler)
  • Development with Eclipse PDT and SVN versioning on subversion.europeansourcing.com
  • Two architectural versions: initial Zend Framework 1.x (MVC, i18n, Zend_Paginator) then PHP native rewrite (simplified deployment, custom Seeker engine)
  • Small team of 2 developers: Jose (35%) and Vincent (65%)
Strategic positioning

Become the reference for promotional product search in Europe - a complement to trade shows and paper directories

Ranking quality

Paying suppliers needed priority placement while maintaining satisfactory relevance for users

Internationalization

Ability to deploy across multiple European markets (France, Italy) as a differentiator

Identified risks

Index quality

Dependency on crawling quality and HTML structure of source sites - poorly structured sites or iframes indexed badly

Scalability

Database schéma storing full response_body (BLOB) per page made the DB very large

Security

SQL queries built via string concaténation instead of prepared statements - SQL injection exposure

Crawler ethics

User-agent impersonating Firefox rather than declaring itself as a bot

Steps - What I Did

Development phases and contributions

Gantt Timeline
Phase 1 - Zend Framework Prototype (~2009-2012)

The first technical implementation used Zend Framework 1.x. This version constitutes the project genesis:

  • Complete web crawler (Tools_Crawler): recursive traversal of supplier websites up to depth 5, with user-agent management, timeouts, MD5 deduplication, and extension filtering
  • Sophisticated indexation: each word extracted from a page is associated with its position (host, URL, title, description, keywords, h1-h6, links, text) via boolean indicators in pages_words
  • Semantic proximity system: words_proximities table for calculating semantic proximity between words
  • Internationalization: Italian translation (gettext, .po/.mo files), with adapted logos for each market
  • Google-like interface: search bar, paginated results, related keyword suggestions, thumbnails via open.thumbshots.org
  • Real data: SQL dump contains 2,335 crawled pages and 22,320 indexed words from Italian promotional product sites
Crawler Indexation Flow
Phase 2 - PHP Native Rewrite (~2012)

A second version was developed in native PHP to simplify deployment and reduce dependencies. This is where my primary contribution was:

  • Seeker search engine: PHP class with 3 search modes (NORMAL, GROUPED, SPECIFIC) and a scoring algorithm based on position coefficients and occurrence counts
  • Product value system: the value table stored rich product metadata (title, reference, description, price, quantity/price JSON, colors JSON, printing, packaging, dimensions, material, weight, sizes, brand, tech sheet, images JSON, PDF)
  • Enriched display: results showed product images, price-by-quantity tables, descriptions with keyword highlighting
  • Integrated ad platform: text ads (ad_text) and banners (ad_banner) with configurable placement (top/right) and AJAX click counters
  • Statistics module: stat/ directory for counting clicks on results and ads
  • Admin tools: IP-restricted administration links for reindexing or deleting resources directly from search results
  • Utility library: custom classes (Tools_Html, Tools_String, Tools_Url, Tools_Date, LOG) with advanced HTML parsing, Unicode-aware text summarization, and URL manipulation
User Search Sequence
Phase 3 - Archive (~2013)

Last modifications date from April 2013. The project was archived without evidence of large-scale production deployment. Unresolved TODO markers, default passwords in code, and PHP errors displayed in .htaccess suggest it remained in prototype/internal use status.

Project Timeline

Actors & Interactions

Team and stakeholders

Team & Contributions

Team of 2 developers only - no dedicated designer, PM, or QA. SVN metadata analysis reveals the contribution split:

ContributorZend FWPHP NativeGlobal
Vincent100%~47%~65%
Jose DA COSTA0%~53%~35%

Percentages estimated from SVN entries (last committer per file). Does not measure code volume written by each person but file ownership at archival time.

Team Contributions
External Stakeholders
  • European Sourcing - commissioning company, platform operator
  • Tendance Objet (tendanceobjet.com) - brand powering the French version
  • Stefania (stefania@europeansourcing.com) - operational contact for the Italian version
  • manager.europeansourcing.com - internal ad platform serving text ads and banners
  • es_crawler - companion crawling tool (separate project, referenced in code)
  • open.thumbshots.org - third-party service for website thumbnails

Results

Deliverables and measurable impact

Code Metrics
Indexation Metrics
File Type Distribution
Delivered Features (16)
Full-text search engine with 3 modes (normal, grouped by site, site-specific)
Recursive web crawler (depth 5, MD5 dedup, extension filtering)
Multi-criteria ranking algorithm (position coefficients, occurrences, PageRank)
Detailed indexation with 16 position zones analyzed per word
Semantic proximity system between words
Enriched product metadata (prices, docs, colors, materials, specs)
Integrated ad platform (text ads + banners with configurable placement)
Complete pagination with page navigation
Keyword highlighting in results and ads
Smart text summarization (relevant passage extraction)
Related search suggestions
French and Italian internationalization (gettext)
Homepage keyword suggestions
IP-restricted online admin tools
Click statistics on results and ads
Complete utility library (HTML parsing, URL manipulation, Unicode strings)
Scoring Coefficients by Position
Entity-Relationship Diagram
Version 1 (PHP Native) - recherchepublicitaire database
Version Comparison: Zend vs PHP Native
Architecture Layers
Technical Stack Radar
Personal Growth
  • PHP OOP + PDO - strengthened object-oriented programming and database abstraction skills
  • Search engine internals - deep understanding of indexation, scoring algorithms, inverted indexes, stop words, and text summarization
  • Web crawling - full HTTP crawler development with depth management, deduplication, and filtering
  • HTML parsing - advanced content extraction (meta tags, headings, body text, iframes, links)
  • URL manipulation - comprehensive URL analysis and normalization (RFC 2616 compliant)
  • SVN collaboration - source code management in a small team
  • Ad platform integration - understanding of digital advertising mechanics (placement, click tracking, monetization)
Business Impact
  • Acquisition channel - the search engine was designed as a complementary acquisition channel for the European Sourcing network, attracting professionals searching for promotional products
  • International coverage - 2 markets (France and Italy) with dedicated logos, translations, and domain names
  • Data volume - 2,335 supplier pages indexed, 22,320 words in dictionary - figures indicating early-stage/prototype status
  • Supplier crawled - multiple Italian promotional product suppliers (acquapubblicitaria.com, adriabandiere.com, agende.it, abcpromotion.it, etc.)

Project Aftermath

Project outcome and what came next

The project was archived in April 2013 after approximately 4 years of development. The last SVN commit (#13) dates from June 2012, and the final file modifications from April 2013.

There is no evidence of large-scale production deployment: unresolved TODO markers throughout the code, default passwords still in source files, PHP errors displayed in .htaccess, and the absence of production configuration files suggest the project never fully reached public-facing production status.

The domain recherche-publicitaire.com and oggetti-promozionali.com are no longer active. The code was preserved on a NAS backup, which served as the source material for this retrospective analysis.

The broader European Sourcing ecosystem continued to evolve after this project. The search capabilities were later absorbed into the main europeansourcing.com platform, which adopted Elasticsearch as its search backend - precisely the kind of specialized search engine technology that would have benefited this project from the start.

Infrastructure & Deployment

Critical Reflection

Retrospective analysis and lessons learned

Strengths
  • Technical ambition - Building a search engine from scratch demonstrates solid understanding of web indexation principles, ranking algorithms, and crawling. The multi-criteria scoring algorithm with position coefficients was a relevant design choice.
  • Rich utility library - Tools_Html, Tools_String, Tools_Url formed an impressive toolkit for web content parsing and manipulation, with fine-grained Unicode handling and URL edge cases.
  • Architectural évolution - The transition from PHP native to Zend Framework shows capacity for architectural évolution and a drive to professionalize the codebase.
  • Advanced search features - 3 search modes (normal, grouped, specific), keyword highlighting, smart summaries, and related searches formed a complete feature set for a vertical search engine.
  • Built-in monetization - The text ad and banner system with configurable placement and click tracking shows a clear business vision.
Areas for Improvement
  • Security gaps - SQL queries built via string concaténation in seeker.php and tools.php exposed the application to SQL injection attacks. A comment reading // securisation avec PDO - if($this->SecureMode) $query = $query; shows the fix was planned but never implemented.
  • Zero test coverage - No unit or functional tests were identified, making refactoring risky and regressions likely.
  • Database scalability - Storing the full response_body (BLOB) per page was problematic for scaling. With only 2,335 pages, the dump was already 512 KB compressed.
  • Code duplication - Both versions shared similar but non-identical utility classes, creating divergence risk with no mechanism for synchronization.
  • Unresolved TODOs - Multiple TODO markers for features never implemented (prepared statements, 404 handling, extension filtering, URL parser bugs).
What I Would Do Differently Today
  • Use prepared statements from day one - SQL security should have been a prerequisite, not a TODO
  • Separate crawler from search engine - two distinct applications would have allowed independent scaling. The companion es_crawler project already existed as a separate tool
  • Adopt Elasticsearch or Solr - rather than reinventing a full-text search engine in raw SQL, a dedicated engine would have offered better performance with advanced features (stemming, fuzzy matching, facets)
  • Do not store full HTML body - only the extracted text and metadata, not the complete page source code
  • Write tests - even minimal test coverage on the search engine and crawler would have secured the project évolution
Lasting Lessons
  • A vertical search engine is an underestimated project - even specialized on a niche, the complexity of indexation, ranking, crawling, and user interface represents a considerable development effort
  • Content quality determines engine value - without rich and up-to-date indexed data, the search engine has no added value compared to Google
  • Internationalization must be planned from the start - the French-to-Italian transition required a significant refactoring toward Zend Framework with gettext, rather than simple configuration
  • Architectural refactoring (PHP native to MVC) is costly - maintaining two parallel versions multiplies effort without guaranteeing convergence

Related journey

Professional experience linked to this achievement

Skills applied

Technical and soft skills applied

Image gallery

Project screenshots and visuals