Elastica

A PHP client for elasticsearch.

Fork me on GitHub

Storing-and-indexing-documents

Storing and indexing documents

To add data to the index you can just drop some documents in and it will be indexed directly. But in most cases you want to specify how your data is indexed. There are a lot of possibilities in elasticsearch to do so. You can decide how each field is mapped and how your data is analyzed to provide the full text search.

For more information and all the possibilities elasticsearch provides, take a look at the Analysis and the Mapping reference.

The documents in elasticsearch are organized in indices. Each index contains one or more types which contains the documents. So to put our data in elasticsearch, we first have to define how the index and the type will look like.

Define Analysis

In elasticsearch, when you create an index, you define the number of shards and number of replicas. A shard is a part of your data and a replica is like an backup of that data. So when you have one node, all the shards and all the replicas will be on that node. When you have more nodes, your data will be balanced to these nodes. How it is balanced depends on your configuration. More on this topic can be found here.

Data in elasticsearch is analyzed at two different times. Once, when you index a document it’s analyzed and this information is put in the index. The other time is when you do a search. Elasticsearch analyzes the search query and looks up the gained information in the index. To see all possible analyzers and filter check out the Analysis reference.

Let’s create an index called twitter! We’ll include two analyzers. The indices names are IMPORTANT because they decide when the analyzer will be used. The analyzer named “default_index” will be the analyzer used at index-time. The analyzer named “default_search” will be used when searching, if a custom analyzer is not provided in the query. default_index defines how the data will be analyzed when it’s indexed and default_search defines how elasticsearch will analyze the search query. You can create analyzers with a random name, you can use these by referencing them in your query as the analyzer to use. In this example we’ll also use a custom snowball filter for the data.

The second argument of \Elastica\Index is an OPTIONAL bool=> (true) Deletes index first if already exists (default = false)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Load index
$elasticaIndex = $elasticaClient->getIndex('twitter');

// Create the index new
$elasticaIndex->create(
    array(
        'number_of_shards' => 4,
        'number_of_replicas' => 1,
        'analysis' => array(
            'analyzer' => array(
                'default_index' => array(
                    'type' => 'custom',
                    'tokenizer' => 'standard',
                    'filter' => array('lowercase', 'mySnowball')
                ),
                'default_search' => array(
                    'type' => 'custom',
                    'tokenizer' => 'standard',
                    'filter' => array('standard', 'lowercase', 'mySnowball')
                )
            ),
            'filter' => array(
                'mySnowball' => array(
                    'type' => 'snowball',
                    'language' => 'German'
                )
            )
        )
    ),
    true
);

Define Mapping

The Mapping defines what kind of data is in which field. If no mapping is defined, elasticsearch will guess the kind of the data and map it automatically. To see all of the possibilities, check out the Mapping reference.

In newer versions of ElasticSearch you can not use the mapping anymore to give your custom analyzers a function. You will have to provide 2 analyzers with the default names.

In our example, we will create an type called tweet which is in our index twitter. So first we create that type and afterwards we define the mapping. Note that it is possible to boost data in elasticsearch. You can boost a specific field like ‘title’ to have more importance over normal content. If we boost a field it’s defined just like the kind of the field. In this example we boost the importance of the ‘fullname’ of the user by a factor of 2.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
//Create a type
$elasticaType = $elasticaIndex->getType('tweet');

// Define mapping
$mapping = new \Elastica\Type\Mapping();
$mapping->setType($elasticaType);

// Set mapping
$mapping->setProperties(array(
    'id'      => array('type' => 'integer', 'include_in_all' => FALSE),
    'user'    => array(
        'type' => 'object',
        'properties' => array(
            'name'      => array('type' => 'string', 'include_in_all' => TRUE),
            'fullName'  => array('type' => 'string', 'include_in_all' => TRUE, 'boost' => 2)
        ),
    ),
    'msg'     => array('type' => 'string', 'include_in_all' => TRUE),
    'tstamp'  => array('type' => 'date', 'include_in_all' => FALSE),
    'location'=> array('type' => 'geo_point', 'include_in_all' => FALSE)
));

// Send mapping to type
$mapping->send();

Add documents

Now that we have our index ready for the data, we just need to go ahead an put some data in there!

First we put together our document. In our example it’s a tweet. This tweet is going to be a \Elastica\Document which is then added to our type tweet in the index twitter.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// The Id of the document
$id = 1;

// Create a document
$tweet = array(
    'id'      => $id,
    'user'    => array(
        'name'      => 'mewantcookie',
        'fullName'  => 'Cookie Monster'
    ),
    'msg'     => 'Me wish there were expression for cookies like there is for apples. "A cookie a day make the doctor diagnose you with diabetes" not catchy.',
    'tstamp'  => '1238081389',
    'location'=> '41.12,-71.34'
);
// First parameter is the id of document.
$tweetDocument = new \Elastica\Document($id, $tweet);

// Add tweet to type
$elasticaType->addDocument($tweetDocument);

// Refresh Index
$elasticaType->getIndex()->refresh();

Now the index contains a document. But that’s not enough! Add more documents to the index, so a search makes sense!

Bulk indexing

Of course you can add one document after another. But what if you want to put the content of a large database this can be slow. It’s better to create an array of documents and add them all at once:

1
2
3
4
5
6
7
8
9
10
11
12
// Create holder for Elastica documents
$documents = array();
while ( ... ) { // Fetching content from the database
    $documents[] = new \Elastica\Document(
        $id,
        array(
            ...
        );
    );
}
$elasticaType->addDocuments($documents);
$elasticaType->getIndex()->refresh();

A good start are 500 documents per bulk operation. Depending on the size of your documents you’ve to play around a little how many documents are a good number for your application.