Tasks

Desegma is structured into two main sub-tasks, the first involves the detection of machine-generated texts at the document level, the second the segmentation of a document into the human-written and the machine-generated part.

SubTask A: MGT Detection in the Wild

In the first sub-task, we explicitly simulate these challenging conditions. Test documents are:

  1. sampled from different semantic domains compared to those in the training set
  2. generated by undisclosed large language models (LLMs)
  3. produced by both vanilla pre-trained models and LLMs that have been fine-tuned to better mimic the linguistic distribution of human-written texts.

The task is structured as a binary-classification problem and defined as following: Given a piece of text, assign it the label 0, if the text is written by a human, and 1 otherwise. Different mixtures of samples in the test data will represent different level of complexity. Simple domain shift, in terms of semantic, will represent the easier setting, while texts generated by the DPO fine-tuned LLMs will represent higher complexity samples.

The performance of proposed solutions will be evaluated via binary pairwise accuracy and F1-score.

Human Text: Viktor Orban, da quando il leghista è diventato ministro dell'Interno, si mandano segnali di apprezzamento reciproco. Un'alleanza che potrebbe portare l'Italia al fianco dei Paesi di Visegrad e che già - in occasione della discussione sulla riforma di Dublino - ha messo in difficoltà gli altri partner dell'Ue. Il primo ad allacciare rapporti era stato Salvini. Da Frosinone il ministro aveva parlato dell'Ungheria come di un paese con cui l'Italia potrà cambiare l'Europa. Entrambi, in fondo, sono dichiaratamente euroscettici. E sia Orban che il leghista ...
Label: 0

Machine Text: Viktor Orban , dopo anni di duri scontri diplomatici, sono pronti a unire le loro forze per riscrivere l'agendia di Bruxelles. Il leader di Fratelli d'Italia Giorgia Meloni e il presidente del governo ungherese si sono visti a Vienna, in un vertice di "centrodestra, identità italiana e sovranità italiana". Dopo un incontro di ben tre ore i due hanno spiegato come si possano conciliare politicamente le due visioni d'Europa
Label: 1

Data description:

Data is now available at this link!

The dataset is a .csv file with 2 columns:

  • text: The texts to classify as machine generated or human written.
  • label: The ground truth, 0 means that the text is human written and 1 that it is machine generated.

Here is an example of how to compute Accuracy, the score used to evaluate submissions for subtask A.

from sklearn.metrics import accuracy_score, classification_report
from transformers import AutoTokenizer, AutoModelForSquenceClassification

# Load your classifier
tokenizer = AutoTokenizer.from_pretrained("my_trained_model")
model = AutoModelForSequenceClassification.from_pretrained("my_trained_model")

# Separate texts and labels
texts = df["text"].to_list()
labels = df["label"].to_list()

# Get your predictions
preds = [m(**tokenizer(text, return_tensors="pt")).logits.argmax(dim=-1).item() for text in texts]

# Compute task metric
task_metric = accuracy_score(y_true=labels, y_pred=preds)

# Inspect classification report
print(classification_report(y_true=labels, y_pred=preds))

SubTask B: Human - Machine Text Segmentation

In the second sub-task, participants are required to detect the boundary between the human-written text and the machine-generated continuation by identifying the index of the character that marks the beginning of the MGT content. Each data sample will consist of a variable-length human-written prompt, always followed by a variable-length continuation produced by the model. Unlike traditional MGT detection tasks that require document-level binary classification, this sub-task focuses on localization: participants must pinpoint the beginning of the text generated by the LLM.

The task is defined as follows: Given a piece of text, return the index of the first character that is generated by an LLM. To ensure statistically robust evaluation, the length of the human-written substring will vary considerably. This setup simulates real-world scenarios in which MGT may be inserted into otherwise human-written content. The same techniques described for the previous sub-task will be used to generate continuations of varying complexity.

The performance of proposed solutions will be evaluated via Mean Absolute Error (MAE).

Shared Human Text: Il presidente del Tribunale internazionale del diritto del mare (Itlos), Vladimir Golitsyn, ha fissato

Machine Continuation: al 10 agosto la data in cui il tribunale arbitrale di Amburgo esaminerà le informazioni che l'Italia intende raccogliere in India per scagionare i Marò . Nei giorni scorsi su vari quotidiani erano uscite indiscrezioni circa la data di un eventuale incontro tra i due fucilieri di Marina e i loro avvocati e i funzionari del ministero dell'Interno indiano, che dovrebbero rilasciare a loro una sorta di "licenza" temporanea così che i due marinai possano recarsi in India.

Target Character Index: 103

Data description:

Data is now available at this link!

The dataset is a .csv file with 3 columns:

  • human: The initial part of the text, written by a human.
  • llm: The Machine-generated continuation. Note that human and llm are meant to be concatenated into a unique input for the model.
  • human_len: The index of the first character outputted by the language model.

Expected Output

For each input text, systems must return a single integer:

  • the predicted index (in terms of characters) where the human part ends and the machine part begins.
  • This index should correspond to the character length of the human-authored portion (i.e., human_len).

For the sake of clarity, we also provide here the evaluation script we used to evaluate the baseline model

from sklearn.metrics import mean_absolute_error


def get_baseline_predictions(input_ids, logits, tokenizer):
    """
    This is the function that takes care of converting the baseline model
    logits to the predicted switch-indices.

    Convert model logits into predicted boundary positions (in characters).
    
    Args:
        input_ids (torch.Tensor): Tokenized input sequences (batch_size x seq_len).
        logits (torch.Tensor): Model output logits (batch_size x seq_len x num_labels).
                              Each position typically has scores for classes like {0 = human, 1 = switch}.
        tokenizer (PreTrainedTokenizer): Hugging Face tokenizer used to decode tokens.
    
    Returns:
        List[int]: Predicted human segment lengths in characters for each sequence.
    """
    # Take the class with highest probability at each token
    preds = logits.argmax(dim=-1)

    sequence_lengths = []
    for i, sequence in enumerate(input_ids):
        # Find the first token predicted as "machine" (the switch point)
        idx = preds[i].nonzero(as_tuple=False)[0].item()

        if idx.numel() > 0:
            # Take the first predicted switch
            idx = idx[0].item()
        else:
            # Fallback: if no switch is predicted, assume end of sequence
            idx = len(sequence)

        # Decode the sequence up to that point, then measure its character length
        sequence_lengths.append(
            len(tokenizer.decode(sequence[:idx], skip_special_tokens=True))
        )
    
    return sequence_lengths


def compute_metric(true_lens, pred_lens):
    """
    Compute the evaluation metric (Mean Absolute Error).
    
    Args:
        true_lens (List[int]): Gold standard human segment lengths (in characters).
        pred_lens (List[int]): Predicted human segment lengths (in characters).
    
    Returns:
        float: Mean Absolute Error between predictions and gold labels.
    """
    mae = mean_absolute_error(y_true=true_lens, y_pred=pred_lens)
    return mae


# Testing the metric...
tokenizer = SomeTokenizer
baseline = SomeModel

human = "This is the human part and"
llm = " this is the LLM part."

# combined = human + llm: "This is the human part and this is the LLM part."
text = human + llm

inputs = tokenizer(text, return_tensors="pt")
input_ids = inputs.input_ids    # shape ([1, 15])
ouput = baseline(input_ids)

# preds_lens is a list of predicted lengths in characters, e.g. [4]
pred_lens = get_baseline_predictions(input_ids, output, tokenizer)

# true_lens is list of true lengths in characters, e.g. [15] (`human_len` col in the dataset)
true_lens = [len(human)]

mae = compute_metric(true_lens, pred_lens)
print(f"True len: {true_lens}, Pred len: {pred_lens}, MAE: {mae}")
# True len: [15], Pred len: [4], MAE: 11.0