Created at : 2025-04-18 07:52
Auther: Soo.Y


๐Ÿ“๋ฉ”๋ชจ

Day2 ์ž๋ฃŒ

Introduction

ํ˜„๋Œ€ ๋จธ์‹ ๋Ÿฌ๋‹์€ ์ด๋ฏธ์ง€, ํ…์ŠคํŠธ, ์˜ค๋””์˜ค ๋“ฑ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ๋ฐœ์ „ํ•˜๊ณ  ์žˆ๋‹ค. ์ด ๋ฐฑ์„œ(whitepaper)์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์ด์งˆ์ ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ ํ‘œํ˜„์œผ๋กœ ํ†ตํ•ฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์‘์šฉ ๋ถ„์•ผ์—์„œ ์›ํ• ํžˆ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ฃผ๋Š” ์ž„๋ฒ ๋”ฉ์˜ ๊ฐ•๋ ฅํ•จ์„ ์†Œ๊ฐœํ•˜๊ณ  ์žˆ๋‹ค.

์ด ๋ฐฑ์„œ์—์„œ ๋‹ค๋ฃจ๋Š” ์ฃผ์š” ๋‚ด์šฉ

  1. ์ž„๋ฒ ๋”ฉ ์ดํ•ดํ•˜๊ธฐ : ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ์™œ ์ž„๋ฒ ๋”ฉ์ด ์ค‘์š”ํ•œ์ง€, ๊ทธ๋ฆฌ๊ณ  ์ž„๋ฒ ๋”ฉ์ด ํ™œ์šฉ๋˜๋Š” ๋‹ค์–‘ํ•œ ์œผ์šฉ ์‚ฌ๋ก€๋ฅผ ์„ค๋ช…ํ•œ๋‹ค.
  2. ์ž„๋ฒ ๋”ฉ ๊ธฐ๋ฒ• : ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์œ ํ˜•(์˜ˆ: ํ…์ŠคํŠธ, ์ด๋ฏธ์ง€ ์˜ค๋””์˜ค ๋“ฑ)์„ ๊ณตํ†ต๋œ ๋ฒกํ„ฐ ๊ณต๊ฐ„์œผ๋กœ ๋งคํ•‘ํ•˜๋Š” ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค.
  3. ํšจ์œจ์ ์ธ ์ž…๋ฒ ๋”ฉ ๊ด€๋ฆฌ : ๋Œ€๊ทœ๋ชจ ์ž„๋ฒ ๋”ฉ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅ, ๊ฒ€์ƒ‰, ํƒ์ƒ‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€๋‹ค๋ฃฌ๋‹ค.
  4. ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค(Vector Databases) : ์ž„๋ฒ ๋”ฉ์„ ํšจ์œจ์ ์œผ๋กœ ๊ด€๋ฆฌํ•˜๊ณ  ์ฟผ๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ํŠนํ™”๋œ ์‹œ์Šคํ…œ๋“ค์„ ์†Œ๊ฐœํ•˜๋ฉฐ, ์‹ค์ œ ์šด์˜ ํ™˜๊ฒฝ์—์„œ์˜ ๊ณ ๋ ค์‚ฌํ•ญ๋„ ํ•จ๊ป˜ ์„ค๋ช…ํ•œ๋‹ค.
  5. ์‹ค์ œ ํ™œ์šฉ ์‚ฌ๋ก€ : ์ž„๋ฒ ๋”ฉ๊ณผ ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๊ฐ€ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ๊ณผ ๊ฒฐํ•ฉ๋˜์–ด ํ˜„์‹ค ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๊ตฌ์ฒด์ ์ธ ์˜ˆ์ œ๋ฅผ ์‚ดํŽด๋ณธ๋‹ค.

๋ฐฑ์„œ ์ „์ฒด๋ฅผ ํ†ตํ•ด, ํ•ต์‹ฌ ๊ฐœ๋…์„ ์ฒดํ—˜ํ•  ์ˆ˜ ์žˆ๋Š” ์ฝ”๋“œ ์˜ˆ์‹œ๋„ ํ•จ๊ป˜ ์ œ๊ณต๋œ๋‹ค.

Why embeddings are important

์ž„๋ฒ ๋”ฉ์ด๋ž€, ํ…์ŠคํŠธ, ์Œ์„ฑ, ์ด๋ฏธ์ง€, ์˜์ƒ๊ณผ ๊ฐ™์€ ํ˜„์‹ค ์„ธ๊ณ„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์น˜ํ™”ํ•œ ํ‘œํ˜„์ด๋‹ค. ์ž„๋ฒ ๋”ฉ์ด๋ผ๋Š” ์ด๋ฆ„์€ ํ•œ ๊ณต๊ฐ„์„ ๋‹ค๋ฅธ ๊ณต๊ฐ„์œผ๋กœ ๋งคํ•‘ํ•˜๊ฑฐ๋‚˜ ์‚ฝ์ž…ํ•˜๋Š” ์ˆ˜ํ•™์  ๊ฐœ๋…์—์„œ ์œ ๋ž˜๋˜์—ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์›๋ž˜ BERT ๋ชจ๋ธ์€ ํ…์ŠคํŠธ๋ฅผ 768๊ฐœ์˜ ์ˆซ์ž ๋ฒกํ„ฐ๋กœ ์ž„๋ฒ ๋”ฉํ•œ๋‹ค. ์ฆ‰, ๋ชจ๋“  ๋ฌธ์žฅ์˜ ๊ณ ์ฐจ์› ๊ณต๊ฐ„์—์„œ 768์ฐจ์›์˜ ์ €์ฐจ์› ๊ณต๊ฐ„์œผ๋กœ ๋งตํ•‘ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

์ž„๋ฒ ๋”ฉ์˜ ํ•ต์‹ฌ ๊ฐœ๋… ์ž„๋ฒ ๋”ฉ์€ ์ €์ฐจ์› ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„๋˜๋ฉฐ, ๋‘ ๋ฒกํ„ฐ ๊ฐ„์˜ ๊ธฐํ•˜ํ•™์  ๊ฑฐ๋ฆฌ(์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ๋“ฑ)๋Š” ์‹ค์ œ ์„ธ๊ณ„ ๊ฐ์ฒด ๊ฐ„์˜ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ๋ฐ˜์˜ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด computer๋ผ๋Š” ๋‹จ์–ด๋Š” ์ปดํ“จํ„ฐ ์ด๋ฏธ์ง€์™€๋„ ์œ ์‚ฌํ•˜๊ณ  laptop๊ณผ๋„ ๋น„์Šทํ•ฎใ„ท๋””๋งŒ car์™€๋Š” ์œ ์‚ฌํ•˜์ง€ ์•Š๋‹ค. ์ฆ‰, ์„œ๋กœ ๋‹ค๋ฅธ ์œ ํ˜•์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆซ์ž ๊ณต๊ฐ„์—์„œ ๋น„๊ตํ•˜๊ณ  ๊ฒ€์ƒ‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋Š” ๊ฒƒ์ด ์ž„๋ฒ ๋”ฉ์˜ ๊ฐ€์žฅ ํฐ ์žฅ์ ์ด๋‹ค.

์ž„๋ฒ ๋”ฉ์€ ์ผ์ข…์˜ ์ •๋ณด ์••์ถ•

  • ์ž„๋ฒ ๋”ฉ์€ ์›๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ ์†์‹ค ์žˆ๋Š” ๋ฐฉ์‹์œผ๋กœ ์••์ถ•ํ•˜๋ฉด์„œ๋„ ์˜์ง€๋จน ํŠน์ง•์„ ์œ ์ง€ํ•œ๋‹ค.
  • ์ด๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์™€ ์ €์žฅ์„ ํšจ์œจ์ ์œผ๋กœ ๋งŒ๋“ค๊ธฐ ์œ„ํ•œ ํ•ต์‹ฌ ๋„๊ตฌ์ด๋‹ค.

์ง๊ด€์ ์ธ ์˜ˆ์‹œ: ์œ„๋„์™€ ๊ฒฝ๋„

  • ์ง€๊ตฌ์ƒ์˜ ์œ„์น˜๋ฅผ ๋‘ ์ˆซ์ž๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ๋„ ํ•˜๋‚˜์˜ ์ž„๋ฒ ๋”ฉ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  • ๋‘ ์œ„์น˜์˜ ์œ„๊ฒฝ๋„๋ฅผ ๋น„๊ตํ•˜๋ฉด ์„œ๋ฆฌ, ์ธ์ ‘ ์œ„์น˜, ์œ ์‚ฌ์„ฑ ๋“ฑ์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์—์„œ๋„ ๋ฒกํ„ฐ ๊ณต๊ฐ„์—์„œ ๊ฐ€๊นŒ์šด ์œ„์น˜์— ์žˆ๋Š” ํ…์ŠคํŠธ๋Š” ์˜๋ฏธ์ ์œผ๋กœ ์œ ์‚ฌํ•œ ํ…์ŠคํŠธ์ž„์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.

๊ฒ€์ƒ‰๊ณผ ์ถ”์ฒœ์—์„œ์˜ ์ค‘์š”์„ฑ RAG, ๊ฒ€์ƒ‰, ์ถ”์ฒœ ์‹œ์Šคํ…œ, ๊ด‘๊ณ , ์ด์ƒ ํƒ์ง€ ๋“ฑ ๋‹ค์–‘ํ•œ ์‹ค๋ฌด ์˜์—ญ์—์„œ ์ž„๋ฒ ๋”ฉ์€ ํ•ต์‹ฌ ์š”์†Œ์ด๋‹ค. ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ํ†ตํ•ด, ๊ฑฐ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ์…‹ ๋‚ด์—์„œ ๋น ๋ฅด๊ฒŒ ์œ ์‚ฌ ํ•ญ๋ชฉ์„ ์ฐพ์•„๋‚ผ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด ์˜ค๋Š˜๋‚ ์˜ ์‹ค์‹œ๊ฐ„ ์‹œ์Šคํ…œ ๊ตฌํ˜„์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค.

  • ๋‹จ, ์œ„๋„/๊ฒฝ๋„๋Š” ์ง€๊ตฌ์˜ ๊ตฌํ˜• ๊ตฌ์กฐ์— ๋”ฐ๋ผ ์„ค๊ณ„๋œ ๊ฒƒ์ด์ง€๋งŒ, ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์€ ์‹ ๊ฒฝ๋ง์ด ํ•™์Šต์„ ํ†ตํ•ด ์ž๋™์œผ๋กœ ๋งŒ๋“ค์–ด๋‚ธ ๊ณต๊ฐ„์ด๋‹ค.

์ค‘์š”! ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋ธ์—์„œ ์ƒ์„ฑํ•œ ์ž„๋ฒ ๋”ฉ์€ ์ง์ ‘ ๋น„๊ต๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ, ํ˜ธํ™˜์„ฑ๊ณผ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•œ ์ž„๋ฒ ๋”ฉ ๋ฒ„์ „์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค.

์ž„๋ฒ ๋”ฉ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ์˜ 3๋‹จ๊ณ„

  1. ๊ฒ€์ƒ‰ ๊ณต๊ฐ„์— ์กด์žฌํ•˜๋Š” ์ˆ˜์‹ญ์–ต ๊ฐœ ์•„์ดํ…œ์— ๋Œ€ํ•ด ์‚ฌ์ „ ์ž„๋ฒ ๋”ฉ ์ˆ˜ํ–‰
  2. ์ฟผ๋ฆฌ๋ฅผ ๋™์ผํ•œ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์œผ๋กœ ๋งตํ•‘
  3. ์ฟผ๋ฆฌ ์ž„๋ฒ ๋”ฉ๊ณผ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ด์›ƒ(Nearest Neighbors)์„ ์ฐพ์•„ ๋น ๋ฅด๊ฒŒ ๋ฐ˜ํ™˜

๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฐ์ดํ„ฐ์—๋„ ์ ํ•ฉ ํ˜„๋Œ€ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์€ ํ…์ŠคํŠธ, ์Œ์„ฑ, ์ด๋ฏธ์ง€, ์˜์ƒ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ์œ ํ˜•์„ ํ•จ๊ป˜ ๋‹ค๋ฃจ๋ฉฐ, ์ด๋Ÿฐ ํ™˜๊ฒฝ์—์„œ๋Š” ๊ณตํ†ต ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์ด ํŠนํžˆ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ž„๋ฒ ๋”ฉ์˜ ํ•ต์‹ฌ ํŠน์ง• ์š”์•ฝ

  • ์˜๋ฏธ์ ์œผ๋กœ ์œ ์‚ฌํ•œ ๊ฐ์ฒด๋ฅผ ๊ฐ€๊นŒ์šด ์œ„์น˜์— ๋งคํ•‘
  • ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์—์„œ ์œ ์šฉํ•œ ์š”์•ฝ๋œ ์˜๋ฏธ ํ‘œํ˜„์œผ๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  • ML ๋ชจ๋ธ ์ž…๋ ฅ, ์ถ”์ฒœ ์‹œ์Šคํ…œ, ๊ฒ€์ƒ‰ ์—”์ง„, ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋“ฑ์—์„œ ํ™œ์šฉ ๊ฐ€๋Šฅ
  • ์ž„๋ฒ ๋”ฉ์€ ์ž‘์—…์— ๋”ฐ๋ผ ์ตœ์ ํ™”๋œ ํ‘œํ˜„์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Œ โ†’ ๊ฐ™์€ ๊ฐ์ฒด๋ผ๋„ ๋ชฉ์ ์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ ๊ฐ€๋Šฅ

Evaluating Embedding Quality

์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์˜ ํ’ˆ์งˆ์€ ์‚ฌ์šฉํ•˜๋Š” ์ž‘์—…์— ๋”ฐ๋ผ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ํ‰๊ฐ€๋œ๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ ์ผ๋ฐ˜์ ์ธ ํ‰๊ฐ€ ์ง€ํ‘œ๋Š”, ๋น„์Šทํ•œ ํ•ญ๋ชฉ์€ ์ž˜ ์ฐพ์•„๋‚ด๊ณ , ๊ด€๋ จ ์—†๋Š” ํ•ญ๋ชฉ์€ ์ œ์™ธํ•˜๋Š” ๋Šฅ๋ ฅ์— ์ดˆ์ ์„ ๋‘”๋‹ค. ์ด๋Ÿฌํ•œ ํ‰๊ฐ€๋Š” ๋ณดํ†ต ์ •๋‹ต์ด ๋ ˆ์ด๋ธ”๋ง๋œ ๋ฐ์ดํ„ฐ์…‹์ด ํ•„์š”ํ•˜๋ฉฐ, ์˜ˆ๋ฅผ ๋“ค์–ด Snippet 0์—์„œ๋Š” NFCorpus ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์ง€ํ‘œ๋ฅผ ์„ค๋ช…ํ•œ๋‹ค.

๊ฒ€์ƒ‰ ๊ณผ์ œ์—์„œ์˜ ์ฃผ์š” ํ‰๊ฐ€ ์ง€ํ‘œ

  1. ์ •ํ™•๋„(Precision) : ๊ฒ€์ƒ‰๋œ ๋ฌธ์„œ ์ค‘์—์„œ ์–ผ๋งˆ๋‚˜ ๋งŽ์€ ๋ฌธ์„œ๊ฐ€ ์‹ค์ œ๋กœ ๊ด€๋ จ์žˆ๋Š” ๋ฌธ์„œ์ธ์ง€ ํ‰๊ฐ€
    • ์˜ˆ: 10๊ฐœ ๋ฌธ์„œ ์ค‘ 7๊ฐœ๊ฐ€ ๊ด€๋ จ ์žˆ๋‹ค๋ฉด โ†’ precision@10 = 7/10 = 0.7
  2. ์žฌํ˜„์œจ(Recall) : ์ „์ฒด ๊ด€๋ จ ๋ฌธ์„œ ์ค‘์—์„œ ์–ผ๋งˆ๋‚˜ ๋งŽ์ด ๊ฒ€์ƒ‰๋˜์—ˆ๋Š”์ง€ ํ‰๊ฐ€
    • ์˜ˆ: ๊ด€๋ จ ๋ฌธ์„œ๊ฐ€ ์ด 6๊ฐœ์ธ๋ฐ ๊ทธ ์ค‘ 3๊ฐœ๊ฐ€ ๊ฒ€์ƒ‰๋˜์—ˆ๋‹ค๋ฉด โ†’ recall@K = 3/6 = 0.5

์ด์ƒ์ ์ธ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์ด๋ผ๋ฉด

  • ๋ชจ๋“  ๊ด€๋ จ ๋ฌธ์„œ๋Š” ๋น ์ง์—†์ด ๊ฒ€์ƒ‰ํ•˜๊ณ 
  • ๊ด€๋ จ ์—†๋Š” ๋ฌธ์„œ๋Š” ํ•˜๋‚˜๋„ ํฌํ•จ๋˜์ง€ ์•Š์•„์•ผ ํ•œ๋‹ค. ํ•˜์ง€๋งŒ ํ˜„์‹ค์—์„œ๋Š” ์ผ๋ถ€ ๊ด€๋ จ ๋ฌธ์„œ๋ฅผ ๋†“์น˜๊ธฐ๋„ ํ•˜๊ณ , ๊ด€๋ จ ์—†๋Š” ๋ฌธ์„œ๊ฐ€ ํฌํ•จ๋˜๊ธฐ๋„ ํ•˜๋ฏ€๋กœ ์ •๋Ÿ‰์ ์ธ ๊ธฐ์ค€์ด ํ•„์š”ํ•˜๋‹ค.

์ˆœ์œ„ ์ •๋ณด๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ์˜ ํ‰๊ฐ€

  • Precision/Recall์€ ๊ด€๋ จ์„ฑ ์—ฌ๋ถ€๊ฐ€ ์ด์ง„(Binary)์ผ ๋•Œ ์œ ์šฉํ•˜์ง€๋งŒ, ์‹ค์ œ ๊ฒ€์ƒ‰ ํ™˜๊ฒฝ์—์„œ๋Š” ๋” ๊ด€๋ จ์„ฑ ๋†’์€ ๋ฌธ์„œ๊ฐ€ ์œ„์— ๋‚˜์˜ค๋Š” ๊ฒƒ์ด ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค.
  • ์ด๋Ÿด ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ ์ง€ํ‘œ๊ฐ€ ๋ฐ”๋กœ โ€œ์ •๊ทœํ™” ํ• ์ธ ๋ˆ„์  ์ด๋“โ€(nDCG)์ด๋‹ค.

nDCG(Normalized Discounted Cumulative Gain)

  • ๊ฐ ๋ฌธ์„œ์˜ ๊ด€๋ จ์„ฑ ์ ์ˆ˜(reli)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ„์‚ฐ
  • ํ•˜๋‹จ์— ์œ„์น˜ํ•œ ๋ฌธ์„œ๋Š” ํŒจ๋„ํ‹ฐ๋ฅผ ๋ถ€์—ฌ
  • ์ด์ƒ์ ์ธ ์ˆœ์„œ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ •๊ทœํ™”ํ•˜์—ฌ 0.0 ~ 1.0 ์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ ํ‘œํ˜„ํ•˜์—ฌ ์„œ๋กœ ๋‹ค๋ฅธ ์ฟผ๋ฆฌ๋‚˜ ์‹œ์Šคํ…œ ๊ฐ„ ๊ณต์ •ํ•œ ๋น„๊ต ๊ฐ€๋Šฅ

์ˆ˜์‹ ์ •๋ฆฌ

โ‘  DCG@p (์ •๋ ฌ๋œ ๊ฒฐ๊ณผ์˜ ํ• ์ธ ๋ˆ„์  ์ด๋“)

  • โ€‹: ์ˆœ์œ„ iii์— ์žˆ๋Š” ๋ฌธ์„œ์˜ ๊ด€๋ จ์„ฑ ์ ์ˆ˜

  • : ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์˜ ์ˆœ์œ„ (1๋ถ€ํ„ฐ ์‹œ์ž‘)

  • : ํ‰๊ฐ€ํ•  ์ƒ์œ„ ๋ฌธ์„œ ์ˆ˜ (ex: @10์ด๋ฉด ์ƒ์œ„ 10๊ฐœ ๋ฌธ์„œ๋งŒ ์‚ฌ์šฉ)


โ‘ก IDCG@p (์ด์ƒ์ ์ธ DCG)

  • โ€‹: ๊ด€๋ จ์„ฑ ์ ์ˆ˜๋ฅผ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌํ•œ ์ด์ƒ์  ์ˆœ์œ„

โ‘ข nDCG@p (์ •๊ทœํ™”๋œ ์ ์ˆ˜)

  • ๊ฐ’ ๋ฒ”์œ„: 0.0 ~ 1.0

  • 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก โ†’ ์ด์ƒ์ ์ธ ์ˆœ์œ„์— ๊ฐ€๊นŒ์›€

import numpy as np
 
def dcg_at_k(relevance_scores, k):
    """
    relevance_scores: ๊ด€๋ จ์„ฑ ์ ์ˆ˜ ๋ฆฌ์ŠคํŠธ (์˜ˆ: [3, 2, 3, 0, 1])
    k: ์ƒ์œ„ ๋ช‡ ๊ฐœ๊นŒ์ง€ ํ‰๊ฐ€ํ• ์ง€ (์˜ˆ: k=5)
    """
    relevance_scores = np.asfarray(relevance_scores)[:k]
    if relevance_scores.size == 0:
        return 0.0
    return np.sum((2 ** relevance_scores - 1) / np.log2(np.arange(2, relevance_scores.size + 2)))
 
def ndcg_at_k(predicted_relevance, ideal_relevance, k):
    """
    predicted_relevance: ์˜ˆ์ธก๋œ ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌ๋œ ๊ด€๋ จ์„ฑ ์ ์ˆ˜ ๋ฆฌ์ŠคํŠธ
    ideal_relevance: ์ด์ƒ์ ์ธ ์ •๋ ฌ ์ˆœ์„œ์˜ ๊ด€๋ จ์„ฑ ์ ์ˆ˜ ๋ฆฌ์ŠคํŠธ (๋‚ด๋ฆผ์ฐจ์ˆœ)
    """
    dcg = dcg_at_k(predicted_relevance, k)
    idcg = dcg_at_k(sorted(ideal_relevance, reverse=True), k)
    return dcg / idcg if idcg != 0 else 0.0
 
# ์˜ˆ์ธก๋œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์˜ ๊ด€๋ จ์„ฑ ์ ์ˆ˜ (์˜ˆ: ๋ชจ๋ธ ์ถœ๋ ฅ ์ˆœ์„œ ๊ธฐ์ค€)
pred = [3, 2, 0, 1, 0]
 
# ํ•ด๋‹น ์ฟผ๋ฆฌ์— ๋Œ€ํ•ด ์ •๋‹ต ๋ฌธ์„œ์˜ ์ด์ƒ์  ๊ด€๋ จ์„ฑ ์ ์ˆ˜ ์ˆœ์„œ
ideal = [3, 3, 2, 1, 0]
 
# ํ‰๊ฐ€ํ•  k๊ฐ’ (์˜ˆ: ์ƒ์œ„ 5๊ฐœ ๋ฌธ์„œ ๊ธฐ์ค€)
k = 5
 
print(f"nDCG@{k}:", round(ndcg_at_k(pred, ideal, k), 4))
 

๊ณต๊ฐœ ๋ฒค์น˜๋งˆํฌ

  • BEIR: ๊ฒ€์ƒ‰/์งˆ๋ฌธ์‘๋‹ต ๋“ฑ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ํ‰๊ฐ€ํ•˜๋Š” ๋Œ€ํ‘œ ๋ฒค์น˜๋งˆํฌ

  • MTEB (Massive Text Embedding Benchmark): ๋Œ€๊ทœ๋ชจ ์ž„๋ฒ ๋”ฉ ํ’ˆ์งˆ ๋น„๊ต๋ฅผ ์œ„ํ•œ ๋ฒค์น˜๋งˆํฌ

  • **TREC (Text REtrieval Conference)**์—์„œ ์ œ์ž‘ํ•œ
    trec_eval์ด๋‚˜ Python ๋ž˜ํผ์ธ pytrec_eval์„ ํ†ตํ•ด
    Precision, Recall, nDCG ๋“ฑ ๋‹ค์–‘ํ•œ ์ง€ํ‘œ๋ฅผ ์ผ๊ด€๋˜๊ฒŒ ๊ณ„์‚ฐ ๊ฐ€๋Šฅ

์‘์šฉ ํ™˜๊ฒฝ์—์„œ์˜ ์ตœ์  ํ‰๊ฐ€

  • ์–ด๋–ค ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์ด โ€œ์ตœ์ โ€์ธ์ง€๋Š” ์ ์šฉ ๋ถ„์•ผ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    ํ•˜์ง€๋งŒ ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ง๊ด€์ด ์ข‹์€ ์ถœ๋ฐœ์ ์ด ๋ฉ๋‹ˆ๋‹ค:

โ€œ์„œ๋กœ ๋น„์Šทํ•œ ๊ฐ์ฒด๋Š” ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์—์„œ๋„ ๊ฐ€๊นŒ์ด ์œ„์น˜ํ•ด์•ผ ํ•œ๋‹ค.โ€

๊ทธ ์™ธ ๊ณ ๋ ค ์š”์†Œ

  • ๋ชจ๋ธ ํฌ๊ธฐ (Model Size)
  • ์ž„๋ฒ ๋”ฉ ์ฐจ์› ์ˆ˜ (Embedding Dimension Size)
  • ์‘๋‹ต ์ง€์—ฐ ์‹œ๊ฐ„ (Latency)
  • ์ „์ฒด ์‹œ์Šคํ…œ ๋น„์šฉ (Total Cost)

์‹ค์ œ ์ œํ’ˆ ํ™˜๊ฒฝ์—์„œ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ์„ ํƒํ•  ๋•Œ ๋งค์šฐ ์ค‘์š”ํ•œ ์š”์†Œ์ž…๋‹ˆ๋‹ค.

Search Example

Types of embeddings

Text embeddings

Word embeddings

Document embeddings

Shallow BoW models

Deeper pretrained large language models

Images & multimodal embeddings

Structured data embeddings

General structured data

User/item structured data

Graph embeddings

Training Embeddings

Vector search

Important vector search algorithms

Locality sensitive hashing & trees

Hierarchical navigable small worlds

ScaNN

Vector databases

Operational considerations

Applications

Q&A with sources(retrieval augmented generation)

RAG(Retrieval-Augmented Generation)์€ ์ •๋ณด ๊ฒ€์ƒ‰(Retrieval)๊ณผ ํ…์ŠคํŠธ ์ƒ์„ฑ(Generation)์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•œ Q&A๋ฐฉ์‹์ด๋‹ค.

์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋‚˜์š”?

  1. ์ง€์‹ ๋ฒ ์ด์Šค์—์„œ ๊ด€๋ จ ๋ฌธ์„œ๋ฅผ ๋จผ์ € ๊ฒ€์ƒ‰ํ•œ๋‹ค.
  2. ๊ฒ€์ƒ‰๋œ ์ •๋ณด๋ฅผ ํ”„๋กฌํ”„ํŠธ์— ์ถ”๊ฐ€ํ•œ๋‹ค.
  3. ํ™•์žฅ๋œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ LLM์ด ์‘๋‹ต์„ ์ƒ์„ฑํ•œ๋‹ค.

Prompt Expansion์ด๋ž€?

  • ํ”„๋กฌํ”„ํŠธ ํ™•์žฅ(Prompt Expansion)์€ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ(์ฃผ๋กœ ์˜๋ฏธ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰ + ๋น„์ง€๋‹ˆ์Šค ๊ทœ์น™)์„ ์›๋ž˜ ํ”„๋กฌํ”„ํŠธ์— ๋ง๋ถ™์ด๋Š” ๊ธฐ์ˆ ์ด๋‹ค.

RAG๋Š” ์–ด๋–ค ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ• ๊นŒ? LLM์ด ๋Œ€ํ‘œ์ ์ธ ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๋Š” ๋ฐ ์œ ์šฉํ•˜๋‹ค.

  1. ํ™˜๊ฐ(hallucination)
    • LLM์ด ์‚ฌ์‹ค์ด ์•„๋‹Œ ๋‚ด์šฉ์„ ๊ทธ๋Ÿด๋“ฏํ•˜๊ฒŒ ๋งŒ๋“ค์–ด๋‚ด๋Š” ํ˜„์ƒ
    • RAG๋Š” ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ฌธ์„œ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜์—ฌ ์ด๋ฅผ ์ค„์—ฌ์ค€๋‹ค.
  2. ์ž์ฃผ ์žฌํ•™์Šตํ•ด์•ผ ํ•˜๋Š” ๋ฌธ์ œ
    • ์ตœ์‹  ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•˜๋ ค๋ฉด ๋ชจ๋ธ์„ ์žฌํ›ˆ๋ จํ•ด์•ผ ํ•˜๋Š” ๋น„์šฉ์ด ํฌ๋‹ค.
    • ํ•˜์ง€๋งŒ RAG๋Š” ์ตœ์‹  ์ •๋ณด๋ฅผ ํ”„๋กฌํ”„ํŠธ๋กœ ์ง์ ‘ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์žฌํ›ˆ๋ จ ์—†์–ด๋„ ์ตœ์‹  ์‘๋‹ต ์ƒ์„ฑ ๊ฐ€๋Šฅ

๋‹จ, RAG๊ฐ€ ํ™˜๊ฐ์„ ์™„์ „ํžˆ ์ œ๊ฑฐํ•˜์ง€ ์•Š๋Š”๋‹ค.

ํ•ด๊ฒฐ์ฑ…: ์ถœ์ฒ˜(source) ๋ฐ˜ํ™˜ + ์ •ํ™•์„ฑ ๊ฒ€์‚ฌ

  • ํ™˜๊ฐ์„ ๋” ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ฒ€์ƒ‰๋œ ์ถœ์ฒ˜๋ฅผ ํ•จ๊ป˜ ๋ฐ˜ํ™˜ํ•˜๊ณ  ์‚ฌ๋žŒ์ด๋‚˜ LLM์ด ํ•ด๋‹น ์ถœ์ฒ˜์™€ ์‘๋‹ต์˜ ์ผ๊ด€์„ฑ์„ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.
  • ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด LLM์ด ์ƒ์„ฑํ•œ ์‘๋‹ต์ด ์‹ค์ œ๋กœ ์˜๋ฏธ์ ์œผ๋กœ ์œ ์‚ฌํ•œ ์ถœ์ฒ˜ ์ •๋ณด์™€ ์ผ์น˜ํ•˜๋Š”์ง€ ๊ฒ€์ฆํ•  ์ˆ˜ ์žˆ๋‹ค.
# Before you start run this command:
# pip install --upgrade --user --quiet google-cloud-aiplatform langchain_google_vertexai
# after running pip install make sure you restart your kernel
 
# TODO : Set values as per your requirements
# Project and Storage Constants
PROJECT_ID = "<my_project_id>"
REGION = "<my_region>"
BUCKET = "<my_gcs_bucket>"
BUCKET_URI = f"gs://{BUCKET}"
 
# The number of dimensions for the text-embedding-005 is 768
# If other embedder is used, the dimensions would probably need to change.
DIMENSIONS = 768
 
# Index Constants
DISPLAY_NAME = "<my_matching_engine_index_id>"
DEPLOYED_INDEX_ID = "yourname01" # you set this. Start with a letter.
from google.cloud import aiplatform
from langchain_google_vertexai import VertexAIEmbeddings
from langchain_google_vertexai import VertexAI
from langchain_google_vertexai import (
    VectorSearchVectorStore,
    VectorSearchVectorStoreDatastore,
)
from langchain.chains import RetrievalQA
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from IPython.display import display, Markdown
 
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)
embedding_model = VertexAIEmbeddings(model_name="text-embedding-005")
 
# NOTE : This operation can take upto 30 seconds
my_index = aiplatform.MatchingEngineIndtex.create_tree_ah_index(
    display_name=DISPLAY_NAME,
    dimensions=DIMENSIONS,
    approximate_neighbors_count=150,
    distance_measure_type="DOT_PRODUCT_DISTANCE",
    index_update_method="STREAM_UPDATE", # allowed values BATCH_UPDATE , STREAM_UPDATE
)
 
# Create an endpoint
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=f"{DISPLAY_NAME}-endpoint", public_endpoint_enabled=True
)
 
# NOTE : This operation can take upto 20 minutes
my_index_endpoint = my_index_endpoint.deploy_index(
    index=my_index, deployed_index_id=DEPLOYED_INDEX_ID
)
 
my_index_endpoint = my_index_endpoint.deploy_index(
    index=my_index, deployed_index_id=DEPLOYED_INDEX_ID
)
 
my_index_endpoint.deployed_indexes
 
# TODO : replace 1234567890123456789 with your acutial index ID
my_index = aiplatform.MatchingEngineIndex("1234567890123456789")
 
# TODO : replace 1234567890123456789 with your acutial endpoint ID
# Be aware that the Index ID differs from the endpoint ID
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint("1234567890123456789")
from langchain_google_vertexai import (
    VectorSearchVectorStore,
    VectorSearchVectorStoreDatastore,
)
 
# Input texts
texts = [
"The cat sat on",
"the mat.",
"I like to",
"eat pizza for",
"dinner.",
"The sun sets",
"in the west.",
]
 
# Create a Vector Store
vector_store = VectorSearchVectorStore.from_components(
    project_id=PROJECT_ID,
    region=REGION,
    gcs_bucket_name=BUCKET,
    index_id=my_index.name,
    endpoint_id=my_index_endpoint.name,
    embedding=embedding_model,
    stream_update=True,
)
 
# Add vectors and mapped text chunks to your vectore store
vector_store.add_texts(texts=texts)
 
# Initialize the vectore_store as retriever
retriever = vector_store.as_retriever()
 
# perform simple similarity search on retriever
retriever.invoke("What are my options in breathable fabric?")

๐Ÿ“œ์ถœ์ฒ˜(์ฐธ๊ณ  ๋ฌธํ—Œ)


๐Ÿ”—์—ฐ๊ฒฐ ๋ฌธ์„œ