Contact
Vamos trabalhar juntos
Motor de Busca Vertical para Produtos Promocionais

Motor de Busca Vertical para Produtos Promocionais

Motor de busca personalizado com crawler integrado, indexacao full-text, algoritmo de scoring multi-criterios e plataforma publicitaria interna

~2009 - 2013
~4 anos
Software Engineer
PHP 5.2/5.3Zend Framework 1.xMySQL 5.1PDOjQueryApacheSVNEclipse PDTGettextcURLHTML4/XHTMLCSS 2.1phpMyAdminUbuntu 10/11

Metricas do codigo

~15K

PHP + HTML/CSS/JS

Metricas de indexacao

22,320

Palavras indexadas

Mercados

2

França + Itália

Páginas indexadas

2,335

Sites de fornecedores

Apresentacao

Contexto geral e escopo do projeto

Recherche Publicitaire was a vertical search engine specialized in the promotional and advertising products sector. Developed at European Sourcing, this project provided B2B professionals with a dedicated search tool to find promotional products (pens, mugs, textiles, gadgets, etc.) through an index built by crawling supplier websites.

The project existed in two geographic and linguistic variations:

  • recherche-publicitaire.com - French version, targeting the francophone promotional product market, powered by the Tendance Objet brand
  • oggetti-promozionali.com - Italian version, targeting the Italian promotional product market

The search engine offered full-text search across crawled supplier catalogs, with results ranked by a multi-criteria scoring algorithm combining word position coefficients, occurrence counts, and supplier partner status. An integrated ad platform (text ads and banners) served by manager.europeansourcing.com monetized the traffic.

Arquitetura global

Objetivos, Contexto, Desafios e Riscos

Desafios estrategicos e restricoes do projeto

Objetivos
  • Build a specialized search engine for promotional products - a better alternative to generic Google searches for this niche B2B sector
  • Automatically index supplier catalogs from the European Sourcing network via web crawling
  • Monetize traffic through an integrated internal ad platform (text ads + banners)
  • Cover the international market: France (recherche-publicitaire.com) and Italy (oggetti-promozionali.com)
Contexto
  • Part of the broader European Sourcing ecosystem (europeansourcing.com, tendanceobjet.com, manager, es_crawler)
  • Development with Eclipse PDT and SVN versioning on subversion.europeansourcing.com
  • Two architectural versions: initial Zend Framework 1.x (MVC, i18n, Zend_Paginator) then PHP native rewrite (simplified deployment, custom Seeker engine)
  • Small team of 2 developers: Jose (35%) and Vincent (65%)
Strategic positioning

Become the reference for promotional product search in Europe - a complement to trade shows and paper directories

Ranking quality

Paying suppliers needed priority placement while maintaining satisfactory relevance for users

Internationalization

Ability to deploy across multiple European markets (France, Italy) as a differentiator

Riscos identificados

Index quality

Dependency on crawling quality and HTML structure of source sites - poorly structured sites or iframes indexed badly

Scalability

Database schéma storing full response_body (BLOB) per page made the DB very large

Security

SQL queries built via string concaténation instead of prepared statements - SQL injection exposure

Crawler ethics

User-agent impersonating Firefox rather than declaring itself as a bot

As etapas - O que eu fiz

Fases de desenvolvimento e contribuicoes

Cronologia Gantt
Phase 1 - Zend Framework Prototype (~2009-2012)

The first technical implementation used Zend Framework 1.x. This version constitutes the project genesis:

  • Complete web crawler (Tools_Crawler): recursive traversal of supplier websites up to depth 5, with user-agent management, timeouts, MD5 deduplication, and extension filtering
  • Sophisticated indexation: each word extracted from a page is associated with its position (host, URL, title, description, keywords, h1-h6, links, text) via boolean indicators in pages_words
  • Semantic proximity system: words_proximities table for calculating semantic proximity between words
  • Internationalization: Italian translation (gettext, .po/.mo files), with adapted logos for each market
  • Google-like interface: search bar, paginated results, related keyword suggestions, thumbnails via open.thumbshots.org
  • Real data: SQL dump contains 2,335 crawled pages and 22,320 indexed words from Italian promotional product sites
Fluxo de indexacao do crawler
Phase 2 - PHP Native Rewrite (~2012)

A second version was developed in native PHP to simplify deployment and reduce dependencies. This is where my primary contribution was:

  • Seeker search engine: PHP class with 3 search modes (NORMAL, GROUPED, SPECIFIC) and a scoring algorithm based on position coefficients and occurrence counts
  • Product value system: the value table stored rich product metadata (title, reference, description, price, quantity/price JSON, colors JSON, printing, packaging, dimensions, material, weight, sizes, brand, tech sheet, images JSON, PDF)
  • Enriched display: results showed product images, price-by-quantity tables, descriptions with keyword highlighting
  • Integrated ad platform: text ads (ad_text) and banners (ad_banner) with configurable placement (top/right) and AJAX click counters
  • Statistics module: stat/ directory for counting clicks on results and ads
  • Admin tools: IP-restricted administration links for reindexing or deleting resources directly from search results
  • Utility library: custom classes (Tools_Html, Tools_String, Tools_Url, Tools_Date, LOG) with advanced HTML parsing, Unicode-aware text summarization, and URL manipulation
Sequencia de busca do usuario
Phase 3 - Archive (~2013)

Last modifications date from April 2013. The project was archived without evidence of large-scale production deployment. Unresolved TODO markers, default passwords in code, and PHP errors displayed in .htaccess suggest it remained in prototype/internal use status.

Cronologia do projeto

Os atores e interacoes

Équipe e partes interessadas

Team & Contributions

Team of 2 developers only - no dedicated designer, PM, or QA. SVN metadata analysis reveals the contribution split:

ContributorZend FWPHP NativeGlobal
Vincent100%~47%~65%
Jose DA COSTA0%~53%~35%

Percentages estimated from SVN entries (last committer per file). Does not measure code volume written by each person but file ownership at archival time.

Contribuicoes da equipe
External Stakeholders
  • European Sourcing - commissioning company, platform operator
  • Tendance Objet (tendanceobjet.com) - brand powering the French version
  • Stefania (stefania@europeansourcing.com) - operational contact for the Italian version
  • manager.europeansourcing.com - internal ad platform serving text ads and banners
  • es_crawler - companion crawling tool (separate project, referenced in code)
  • open.thumbshots.org - third-party service for website thumbnails

Os resultados

Entregas e impacto mensuravel

Metricas do codigo
Metricas de indexacao
Distribuicao por tipo de arquivo
Delivered Features (16)
Full-text search engine with 3 modes (normal, grouped by site, site-specific)
Recursive web crawler (depth 5, MD5 dedup, extension filtering)
Multi-criteria ranking algorithm (position coefficients, occurrences, PageRank)
Detailed indexation with 16 position zones analyzed per word
Semantic proximity system between words
Enriched product metadata (prices, docs, colors, materials, specs)
Integrated ad platform (text ads + banners with configurable placement)
Complete pagination with page navigation
Keyword highlighting in results and ads
Smart text summarization (relevant passage extraction)
Related search suggestions
French and Italian internationalization (gettext)
Homepage keyword suggestions
IP-restricted online admin tools
Click statistics on results and ads
Complete utility library (HTML parsing, URL manipulation, Unicode strings)
Coeficientes de scoring por posicao
Diagrama entidade-relacionamento
Version 1 (PHP Native) - recherchepublicitaire database
Comparacao de versoes: Zend vs PHP Nativo
Camadas de arquitetura
Radar de competencias tecnicas
Personal Growth
  • PHP OOP + PDO - strengthened object-oriented programming and database abstraction skills
  • Search engine internals - deep understanding of indexation, scoring algorithms, inverted indexes, stop words, and text summarization
  • Web crawling - full HTTP crawler development with depth management, deduplication, and filtering
  • HTML parsing - advanced content extraction (meta tags, headings, body text, iframes, links)
  • URL manipulation - comprehensive URL analysis and normalization (RFC 2616 compliant)
  • SVN collaboration - source code management in a small team
  • Ad platform integration - understanding of digital advertising mechanics (placement, click tracking, monetization)
Business Impact
  • Acquisition channel - the search engine was designed as a complementary acquisition channel for the European Sourcing network, attracting professionals searching for promotional products
  • International coverage - 2 markets (France and Italy) with dedicated logos, translations, and domain names
  • Data volume - 2,335 supplier pages indexed, 22,320 words in dictionary - figures indicating early-stage/prototype status
  • Supplier crawled - multiple Italian promotional product suppliers (acquapubblicitaria.com, adriabandiere.com, agende.it, abcpromotion.it, etc.)

Os desdobramentos do projeto

Destino do projeto e continuacao

The project was archived in April 2013 after approximately 4 years of development. The last SVN commit (#13) dates from June 2012, and the final file modifications from April 2013.

There is no evidence of large-scale production deployment: unresolved TODO markers throughout the code, default passwords still in source files, PHP errors displayed in .htaccess, and the absence of production configuration files suggest the project never fully reached public-facing production status.

The domain recherche-publicitaire.com and oggetti-promozionali.com are no longer active. The code was preserved on a NAS backup, which served as the source material for this retrospective analysis.

The broader European Sourcing ecosystem continued to evolve after this project. The search capabilities were later absorbed into the main europeansourcing.com platform, which adopted Elasticsearch as its search backend - precisely the kind of specialized search engine technology that would have benefited this project from the start.

Infraestrutura e implantacao

Minha reflexao critica

Analise retrospectiva e aprendizados

Strengths
  • Technical ambition - Building a search engine from scratch demonstrates solid understanding of web indexation principles, ranking algorithms, and crawling. The multi-criteria scoring algorithm with position coefficients was a relevant design choice.
  • Rich utility library - Tools_Html, Tools_String, Tools_Url formed an impressive toolkit for web content parsing and manipulation, with fine-grained Unicode handling and URL edge cases.
  • Architectural évolution - The transition from PHP native to Zend Framework shows capacity for architectural évolution and a drive to professionalize the codebase.
  • Advanced search features - 3 search modes (normal, grouped, specific), keyword highlighting, smart summaries, and related searches formed a complete feature set for a vertical search engine.
  • Built-in monetization - The text ad and banner system with configurable placement and click tracking shows a clear business vision.
Areas for Improvement
  • Security gaps - SQL queries built via string concaténation in seeker.php and tools.php exposed the application to SQL injection attacks. A comment reading // securisation avec PDO - if($this->SecureMode) $query = $query; shows the fix was planned but never implemented.
  • Zero test coverage - No unit or functional tests were identified, making refactoring risky and regressions likely.
  • Database scalability - Storing the full response_body (BLOB) per page was problematic for scaling. With only 2,335 pages, the dump was already 512 KB compressed.
  • Code duplication - Both versions shared similar but non-identical utility classes, creating divergence risk with no mechanism for synchronization.
  • Unresolved TODOs - Multiple TODO markers for features never implemented (prepared statements, 404 handling, extension filtering, URL parser bugs).
What I Would Do Differently Today
  • Use prepared statements from day one - SQL security should have been a prerequisite, not a TODO
  • Separate crawler from search engine - two distinct applications would have allowed independent scaling. The companion es_crawler project already existed as a separate tool
  • Adopt Elasticsearch or Solr - rather than reinventing a full-text search engine in raw SQL, a dedicated engine would have offered better performance with advanced features (stemming, fuzzy matching, facets)
  • Do not store full HTML body - only the extracted text and metadata, not the complete page source code
  • Write tests - even minimal test coverage on the search engine and crawler would have secured the project évolution
Lasting Lessons
  • A vertical search engine is an underestimated project - even specialized on a niche, the complexity of indexation, ranking, crawling, and user interface represents a considerable development effort
  • Content quality determines engine value - without rich and up-to-date indexed data, the search engine has no added value compared to Google
  • Internationalization must be planned from the start - the French-to-Italian transition required a significant refactoring toward Zend Framework with gettext, rather than simple configuration
  • Architectural refactoring (PHP native to MVC) is costly - maintaining two parallel versions multiplies effort without guaranteeing convergence

Trajetoria relacionada

Experiencia profissional ligada a esta realizacao

Competencias aplicadas

Competencias tecnicas e humanas aplicadas

Galeria de imagens

Capturas e visuais do projeto