Pikurate ES 동작원리1(2021.12.12~2021.12.15)

<aside> ⭐ django-elasticsearch-dls 패키지를 파악하고 Pikurate에서 작동하는 원리를 파악하도록 하자

</aside>

Django Settings

./manage.py search_index --rebuild을 통해서 인덱싱을 실행하게 된다.

Index

인덱스의 이름 정하기

장식자를 이용한 이름 정하기

# The name of your index
car = Index('cars')
# See Elasticsearch Indices API reference for available settings
car.settings(
    number_of_shards=1,
    number_of_replicas=0
)

@registry.register_document
@car.documen
class CarDocumnet(Documnet):
		class Django:
				...

이너 클래스를 이용한 인덱스 설정

@registry.register_documnet
class ManufacureDocumnet(Documnet):
		class Index:
				name = manufacture
				settings = {'number_of_shards': 1,
										'number_of_replicas': 0}
		class Django:
				...

Meta 클래스를 이용한 빠른 인덱싱

공식 문서에 따르면

If your model have huge amount of data, its preferred to use parallel indexing. To do that, you can pass –parallel flag while reindexing or populating. - Django ES DSL -

이는 인덱싱할 데이터의 양이 많을 경우를 대비해서 병렬적인 인덱싱을 지원한다는 의미를 가진다. 이를 구현하기 위해서는 이너 클래스로 Meta 클래스를 정의하고 parallel과 관련된 내용을 정의하면 된다.

Fields

ES 데이터 베이스에 색인을 하는 과정에서 저장되는 필드를 지정하는 부분을 의미한다. Elastic에 저장되는 하나의 row인 문서 클래스를 정의할 때, 어떤 형식의 필드를 저장할 것인지를 지정할 수 있다. 일반적으로는 Django라는 이너 클래스를 정의하여 field에 추가할 요소들을 정의할 수 있다.

class Car(models.Model):
		name = models.CharField(max_length=30)
		color = models.CharFiled(max_length=10)
		description = models.TextField()
		type = models.IntegerField()

이 방식이 가장 흔한 방식이지만, ES에는 문서에 저장되는 type이 정수가 아닌 문자열로 저장을 하고 싶을 때는, model에서 함수를 정의하여 이를 가져다 사용하면 된다.

@registry.register_document
class CarDocument(Document):
    class Django:
        model = Car
        fields = [
            'name',
            'color',
            'description',
						'type',
        ]

class Car(models.Model):
		...
		def type_to_string(self):
        if self.type == 1:
            return "Sedan"
        elif self.type == 2:
            return "Truck"
        else:
            return "SUV"

@registry.register_document
class CarDocument(Document):
		type = fields.TextField(attr="type_to_string")
    class Django:
        model = Car
        fields = [
            'name',
            'color',
            'description',
        ]

위의 형태처럼 정의한 type_to_string을 오른쪽 문서 클래스에 커스텀할 필드에 정의를 해주면 된다. 이때, 이너 클래스엔 커스텀으로 지정해준 필드를 제거해 주면 된다.

attr을 정의할 경우 필드를 만들때 참고할 메서드의 이름을 넣어주면 모델에서 정의한 함수가 실행되면서 필드에 값이 들어가게 된다.

Analyzer

elasticsearch_dsl에 있는 analyzer를 이용하여 정의를 할 수 있다. 패키지를 이용하여 애널라이저를 다음과 같이 정의할 수 있다.

→ https://elasticsearch-dsl.readthedocs.io/en/stable/persistence.html#analysis

from elasticsearch_dsl import analyzer, tokenizer

my_analyzer = analyzer('<name>',
    tokenizer=tokenizer('trigram', 'nGram', min_gram=3, max_gram=3),
    filter=['lowercase']
)

DSL에서 제공하는 analyze API를 이서 제공하는 simulate메서드를 사용하면 된다. pikurate.apps.search_indexes.documents.analyzers에서 정의한 애널라이저를 사용해서 토큰 분석을 진행해보자

검색을 위한 애널라이저이다. 소문자화를 하여 엣지 엔그램을 적용한 모습이다.

simulate를 이용해서 분석한 결과이다.

html_strip: HTML 태그를 자동으로 없애주는 애널라이징 기능을 제공한다.

br 태그를 제외한 hello만 나온 것을 확인할 수 있다.