Skip to main content

ListWords API - Filtering and Querying

Overview

The ListWords API uses a unified CEL (Common Expression Language) based filtering approach for maximum flexibility and consistency across all List APIs.

API Definition

message ListWordsRequest {
string filter = 1; // CEL expression for filtering
string order_by = 2; // Sorting specification
common.v1.PaginationRequest pagination = 3;
}

Filter Fields

The following fields are available for filtering:

FieldTypeOperatorsDescription
languagestring==, !=ISO 639-1 language code (e.g., "en", "fr")
keywordstring==, !=Searches within lemma AND inflected forms (partial match, case-insensitive)
categorystring[]inFilters by word categories/tags (OR logic)
surfacestring[]inBatch lookup by exact lemma OR inflected forms (exact match)

Filter Examples

Language Filtering

Find all English words:

filter: 'language == "en"'

Find all French words:

filter: 'language == "fr"'

The keyword field searches in both the lemma and all inflected forms, making it ideal for fuzzy search:

Search for words containing "book":

filter: 'keyword == "book"'
// Finds: "book", "books", "booking", etc.

Search by inflected form - finds the base word:

filter: 'keyword == "apples"'
// Finds: "apple" (because "apples" is its plural form)
filter: 'keyword == "running"'
// Finds: "run" (because "running" is its present participle)

Partial match:

filter: 'keyword == "swim"'
// Finds: "swim" (matches lemma and forms like "swimming", "swam")

Category Filtering

Find CET-4 level words:

filter: 'category in ["cet4"]'

Find words that are in either CET-4 OR CET-6:

filter: 'category in ["cet4", "cet6"]'

Surface Term Lookup (Batch Lookup)

The surface field enables exact match batch lookup of words by their lemma or inflected forms.

Key differences from keyword:

  • keyword: Partial match (contains), searches in both lemma and forms
  • surface: Exact match, used for batch lookup of specific terms

The surface field is particularly useful for:

  • Looking up multiple specific words in one query
  • Finding words by exact inflected forms (e.g., "running" → "run")
  • Batch operations with exact term matching

Find word by exact lemma:

filter: 'surface in ["run"]'

Find word by exact inflected form:

filter: 'surface in ["running"]'
// Returns: "run" (exact match on the form "running")

Batch lookup multiple words:

filter: 'surface in ["run", "swim", "walk"]'
// Returns all three words

Batch lookup by mixed lemmas and forms:

filter: 'surface in ["running", "swam", "walked"]'
// Returns: run, swim, walk (by their inflected forms)

Combined Filtering

English words in technology category:

filter: 'language == "en" && category in ["technology"]'

CET-4 words containing "comp":

filter: 'keyword == "comp" && category in ["cet4"]'

English words matching specific surface forms:

filter: 'language == "en" && surface in ["running", "swimming"]'

Sorting

The order_by field supports:

FieldDescription
lemmaSort by lemma alphabetically
created_atSort by creation time
updated_atSort by last update time

Add desc for descending order:

order_by: "lemma"           # Ascending (A-Z)
order_by: "lemma desc" # Descending (Z-A)
order_by: "created_at desc" # Newest first

Pagination

Standard pagination with page number and page size:

pagination: {
page_no: 1,
page_size: 20
}
  • page_no: Page number (1-indexed)
  • page_size: Number of items per page (max: 100)

Complete Examples

Example 1: English CET-4 words, sorted alphabetically

{
"filter": "language == \"en\" && category in [\"cet4\"]",
"order_by": "lemma",
"pagination": {
"page_no": 1,
"page_size": 50
}
}

Example 2: Batch lookup with sorting

{
"filter": "surface in [\"running\", \"swimming\", \"walking\"]",
"order_by": "lemma",
"pagination": {
"page_no": 1,
"page_size": 10
}
}

Example 3: Technology words updated recently

{
"filter": "language == \"en\" && category in [\"technology\"]",
"order_by": "updated_at desc",
"pagination": {
"page_no": 1,
"page_size": 20
}
}

Implementation Details

Surface Term Lookup

The surface field uses a sophisticated query that:

  1. Matches the word's lemma directly (case-insensitive)
  2. Joins with the lexemes and lexeme_forms tables to find words with matching inflected forms
  3. Uses OR logic to return words that match ANY of the provided surface terms

SQL logic (simplified):

SELECT * FROM words 
WHERE
LOWER(lemma) IN ('running', 'swimming') OR
EXISTS (
SELECT 1 FROM lexemes l
JOIN lexeme_forms f ON l.id = f.lexeme_id
WHERE l.word_id = words.id
AND LOWER(f.text) IN ('running', 'swimming')
)

This enables efficient batch lookup while maintaining the ability to find words by any of their forms.

Category Filtering Logic

Category filtering uses OR logic:

  • category in ["cet4", "cet6"] returns words that have EITHER "cet4" OR "cet6" in their categories array
  • A word with categories: ["cet4", "technology"] will match the above filter

Performance Considerations

  • Keyword searches use case-insensitive partial matching
  • Surface term lookups use indexed queries for efficiency
  • Category filtering leverages JSONB array operators in PostgreSQL
  • Pagination is applied after filtering and sorting to minimize data transfer