Synchronization with Language Depot

Maria_Stolen · November 17, 2020, 9:25am

Can anyone explain how Language Depot synchronization works with Language Forge and FLEx? (I know where all the buttons are; I’m wondering what actually happens when I push the buttons.) Specifically: is it backing up/cloning/replacing items? What happens if LF and FLEx have both been changed and then you try to sync? Is it writing one version over another?

What happens to custom fields? If they are created in LF? In FLEx?

What happens when data disappears after a sync? Is it possible to dig it of the diff logs on LD?

Or maybe there’s an article already that explains all this?

Juha_Metsakallas · November 18, 2020, 1:26pm

AFAIK Language Depot is a Mercurial repository, i.e. it keeps all the files and does extensive bookkeeping of all changes.

To simplify:

Let’s say you pull (receive in SIL terminology) a record X with a version mark 100. Edit something and push (send) it back. X gets a version mark 101.

Now if user A pulls X with X at 101 and user B at 102. B finishes edits first and pushes X back, X gets stamp 103. When A finishes and tries to push their changes, a conflict arises because the system knows that they pulled at 101 but the current stored is 103.

I’m not sure how LD handles this, but I think they give you a simple option to choose between overriding (B’s edits are lost) or falling back (A’s edits are lost). Therefore it is imperative to S/R often.

I don’t know what is the locking level in LD, i.e. what is the smallest edit unit that is pulled/pushed and gets lost in case of a conflict.

HTH

chris_hirt · November 23, 2020, 10:06am

Hi Maria,

In addition to Juha’s helpful reply, here’s some more details:

FLEx <–> FLEx and FLEx <–> LF Send/Receive have the same general conflict resolution strategy, which can be described as “They Win”. This is context-dependent however. e.g. if one edits and the other deletes, the edit always survives regardless of who “they” is.
FLEx <–> LF send/receive is only partially implemented. Input systems, list items, and custom fields currently only sync one-way from FLEx into LF, and then only initially.
Pushing Send/Receive in LF does create a “commit” in the repo, and this can be seen using Language Depot and other Mercurial tools which display the project commit history.
Once FLEx <–> LF sync is working (and it appears to only work for some projects), only the data that has changed in FLEx/LF gets updated. It does not do a full replace-all-items-every-time operation, like I think you are suggesting.
Figuring out what has changed in FLEx can be challenging. Reverting back to a previous version of the repo is also challenging. LF has a partially implemented “activity” feature which could have been helpful, but currently it only works on the entry level. Having a project-wide activity feed where every change is tracked and displayed is something we wanted to do, but ran out of resources to implement.

Thank you for your questions.

Maria_Stolen · November 25, 2020, 10:33am

Thanks, that sort of makes sense! Especially the bit that edits trump deletions is useful to know.

Not realizing how things worked, we were entering plurals as a custom field. Turns out syncing deleted all of those at some point.

Unfortunately I managed to completely break synchronization between LF and LD, so may end up shifting my team over to WeSay if I can figure that out.

Robin_Munn · December 18, 2020, 7:23am

To reply to Juha’s statement about the locking level in LD (the smallest edit unit that is pulled/pushed and gets lost in case of a conflict), when it comes to Language Depot and FLEx data, the answer is the same. For the linguistic data that Language Depot and FLEx use, the smallest edit unit that can conflict is a single field in a single entry (and even smaller than that in certain circumstances). In Juha’s example where user A pulled a dictionary entry X at version 101, and B pulled entry X at version 102, and then B edited and pushed (creating version 103) before A pushed his own edits, here’s how that would go:

Scenario 1: A and B edited different fields of the entry
If B edited the Grammatical Notes field, and A edited the Definition field, there would be no conflict. When A pushed, the Send/Receive system would notice that his version of entry X was at version 101, and the most recent version was 103, including an edit by B to the Grammatical Notes field. Since both edits were in different fields, there’s no chance of the edits conflicting. So Language Depot would create version 104 as a merge version, that merges versions 101 and 103. Version 104 would have both A’s Definition edits and B’s Grammatical Notes edits included.

Scenario 2: A and B edited the same field, but different parts of the field
Most fields in FLEx and Language Forge are multi-part fields. The most common kind of multi-part field is a field where you can have different text for each writing system in your project. E.g., in the Lexeme field you’ll often have two different writing systems, one for IPA and one for the actual alphabet of the language. If A and B both edited the Lexeme field, but B edited the IPA part of the field while A edited the local-alphabet part, then again there would be no conflict. The Send/Receive system would see that A had seen version 101 where the local-alphabet lexeme field contained “cat”, and since version 103 still has “cat” in the local-alphabet lexeme, it would allow A’s edit (where he changed “cat” to “kitty”) to be merged. Meanwhile, B’s edit changing the IPA field from “kʰæt” to “kʰɪɾi” would be preserved, so version 104 would be created with the Lexeme field containing “kʰɪɾi” in the IPA writing system, and “kitty” in the the local-alphabet writing system. Again, no conflict.

Scenario 3: A and B edited the same part of the same field, in exactly the same way
If A and B had both edited the Lexeme field, and made the exact same edit, again there would be no conflict. (I think; this particular scenario is one that I haven’t seen happen in practice, so I’m not 100% sure if I’m right about how the Send/Receive system would handle it). Here, the Send/Receive system would use the version history graph to figure out that there was no conflict. FieldWorks and Language Forge don’t show you the version history graph, but if you use advanced tools for working with Mercurial repositories (most of which are designed for software developers), it’s possible to get it to show you a graph of which version was based on which other version, which version was a merge between two previous versions, and so on. Usually the graph will look like a straight line, more or less, but in situations like the one Juha mentioned it’s possible you’ll see a place where there’s a “fork” in the graph, which hopefully gets resolved again soon. In this case, the “fork” would happen at version 101. On one side of the fork would be version 102 (based on 101), then version 103 that B created, and on the other side of the fork would be version 101, and nothing after that until A pushes his change and the two sides of the fork get resolved. (If you’re a more visual kind of person, there’s a diagram below that shows more or less what that would look like, though since I copied this graph from someone else’s example scenario, the ID numbers are different from the ones in Juha’s example).

Now, as I said, there’s a version history graph, and the Send/Receive system can trace the “parent” of each version, and when it needs to merge two commits it can find the “ancestor” version of each commit. And it can compare what the data looked like in the “ancestor” commit to what it looked like after A was done editing it, and compare what the data looked like in the “ancestor” commit to what it looked like after B was done editing it. So it would see that at version 101, the Lexeme field contained “cat”. B changed “cat” to “kitty” in version 103, and A is now trying to change “cat” to “kitty”. Since both edits are the same, there’s no conflict, and it will create a version 104 that merges versions 101 and 103, and which has “kitty” in the Lexeme field.

treeGraph

Scenario 4: A and B edited the same part of the same field, in different ways
This is the one that will cause conflicts, and as Chris mentioned, the general resolution strategy when the Send/Receive system finds a conflict is the “They Win” strategy. In other words, when A does a Send/Receive, if his edit conflicts with B’s then B’s edit will win, and A will be notified of the conflict. (If B had done the Send/Receive, then A’s edit would have won). In Language Forge, the conflict notification will show up as a comment on the relevant lexical entry. In FLEx, the conflict notification shows up when you’re done with the Send/Receive process, and it shows you all the conflicts it found. It also creates an annotation on the lexical entry, which shows up as a little icon of an exclamation mark in a triangle (or is it in a circle?); clicking on the icon will show you the conflict data, which should look something like “A edited the field to say “domesticated cat”, while B edited the field to say “kitty”. Send/Receive kept the edit made by B.” So while A’s change has been “lost” in the sense that it’s no longer in the data, it’s been preserved in the notification message. The idea is that A will see the conflict message, and talk to B about which one it should be. Then if they agree that A’s change was correct and it really should be “domesticated cat”, then A will edit the field again, creating another version where “kitty” is changed to “domesticated cat”. (If the change was really long, then the fact that it’s still available in the notification log will be helpful. as A can just copy and paste the text to re-create the change). This time when A sends that commit to Language Depot, it will be based on the most recent version of the data so it will “win” and the final version of the data will have “domesticated cat” in that entry.

Scenario 5: A and B both add something to the same list
This is by far the most common edit conflict, and it’s not a serious one. If there was a list somewhere (such as a list of lexical senses in an entry), and both A and B added a new item to the list, then the Send/Receive system won’t know which order the items should go in, and it will just choose an order arbitrarily, and create an edit-conflict notification about it. For example, if version 101 of the data had just one definition for the word “bank” (“a business that keeps money and pays interest”), maybe A would add the definition “the side of a river”, while B would add the definition “to lean to one side while turning, like an airplane”. Both A and B will then submit a version that has two senses, sense 1 being the “money” sense, and sense 2 being the one they just added (“river” or “turn” respectively). When A submits his change, the Send/Receive system will notice that he only had the "money sense in his data, while the Language Depot data had two senses, “money” and “turn”. The “turn” sense and the “river” sense can’t both be sense 2, so the Send/Receive system will arbitrarily pick one of them (the change made by B, in this case) to be sense 2, and will make A’s change become sense 3. It will then add a conflict notification, so that A and B can talk about the sense order in case it matters, and rearrange it in a later version of the data if they decide that the “river” sense should come before the “turn” sense.

There are other scenarios, such as one where B deleted a lexical entry while A made changes to it, but Chris has already mentioned what happens there (the edits always win over the deletes, so that data isn’t lost; if the entry really should have been deleted, then you can always delete it again after pulling in the latest version of the data, and as long as what you’re deleting is truly the latest version then the delete will happen). As you can see, conflicts are really rare, because even when people have different versions of the data, most of the time they’re editing different lexical entries or different parts of the same entry, and the Send/Receive system is able to preserve both changes with no conflict. And even when a conflict does arise, the most common reason is because two people added data in the same place, and the Send/Receive system can easily preserve both of the things that were added, and it only needs human intervention to verify which order was desired (if it matters). So conflicts where A’s change is really “lost” (it’s never totally lost because it gets preserved in the conflict notification) are really quite rare.

I hope this glimpse “behind the curtain” helps you understand the Send/Receive system a little bit better.

Robin Munn, Language Technology developer

Maria_Stolen · December 30, 2020, 4:22pm

Thanks, Robin. This was a very helpful explanation!