Using apache lucene

4/11/2023

For example, Lucene's 'MoreLikeThis' Class can generate recommendations for similar documents. Lucene has also been used to implement recommendation systems. Lucene includes a feature to perform a fuzzy search based on edit distance. While suitable for any application that requires full text indexing and searching capability, Lucene is recognized for its utility in the implementation of Internet search engines and local, single-site searching. In March 2021, Lucene changed its logo, and Apache Solr became a top level Apache project again, independent from Lucene. Version 4.0 was released on October 12, 2012. In March 2010, the Apache Solr search server joined as a Lucene sub-project, merging the developer communities. These three are now independent top-level projects. Lucene formerly included a number of sub-projects, such as Lucene.NET, Mahout, Tika and Nutch. The name Lucene is Doug Cutting's wife's middle name and her maternal grandmother's first name.

It joined the Apache Software Foundation's Jakarta family of open-source Java products in September 2001 and became its own top-level Apache project in February 2005. It was initially available for download from its home at the SourceForge web site. Lucene was his fifth search engine, having previously written two while at Xerox PARC, one at Apple, and a fourth at Excite. A Filter can be use to permit or prohibit one or more terms in the search results.Doug Cutting originally wrote Lucene in 1999. It uses the IndexReader to access the Index and to retrieve all the terms that matches the the terms in the Query object, and returns the hints and topdocs as search result. The IndexSearcher is the core of the search process. The Query object has to be analyze, then assign to the IndexSearcher. This one is transform by the QueryParser into an object of type Query. To Search inside the index, the user has to provide a human-readable expression called query string.

At the end of the analysis process a Lucene Document is broken into terms(also called terms) that are use for search. The analyzer purges Lucene Documents from useless contents like space, hyphen, stop words and much more depending on the choosen anaylzer(s). The indexWriter uses one or more Analyser as a Strategy for index writing. This is done according to particular attributes.

Once a Lucene Document is created, the IndexWriter is the next component that is in charge to analyze and store Lucene Documents into the index. Some of them are JTidy : a HTML Parser, Pdfbox: a PDF documents parser and SAX: an XML Parser. Document parsers are not part of the Apache Lucene core. These are used for further processing during indexing and search.Įach common document type like HTML, PDF, XML and so on needs a specific document parser to extract its contents. The Document Handler interface allows the extraction of information like textual contents, numbers and meta data from original documents and provide them as Lucene Documents. For this purpose a Document Handler interface is needed, this one is provided by the Lucene contribution Library. Any Application using Apache Lucene must first of all transform its original data, into Lucene Documents.

0 Comments

BLOG

Using apache lucene

Leave a Reply.

Author

Archives

Categories