Need tips for lexical models

I need tips on how to maintain an up-to-date wordlist and frequency table as new sources become available.
It’s tedious generating a wordlist from raw text all over again and going through the list to make corrections, etc.
Any helpful tips is appreciated please.

Hello @katelem,

Please refer to this link and check if PrimerPrep tool could help you with this.

Thank you.

I’ve been using PrimerPrep. My problem is:

  • This list contains over 30,000 items (with frequencies. I took days to manually edit the list, identify the proper nouns and change them to initial caps, since PrimerPrep outputs only lower cases.
  • There are now new sources (textbooks, handbooks, etc) which have not been crawled.
  • What should be the next step since I can’t sit for days again going through the word list to find proper nouns and change case.
  • Is there any way to simply add the new generated wordlist to the existing one without having many duplicates, etc.
1 Like

The compiler will accept multiple source files and add the counts together for duplicate words. I’d suggest you take your new sources and process them independently of the existing wordlist, then specify the resulting file as a second wordlist source for the lexical model.

If you do want to combine two wordlists, one way to do that is to use the Pivot Table spreadsheet function. Let me know if you would like more details. (And someone else might have an easier suggestion.)

I understand your concern with proper nouns. Suppose you take a copy of the original text and use an editor (with support for regular expressions) to change \s+[a-z]+ to nothing (empty string) through the whole file. (You would need to change what is in the square brackets to include other letters in your language.) This would leave capitalized words. Many of those might be merely the first word in the sentence, so you’d need to eliminate them as well, leaving a list of proper nouns. You could then use that list to correct the capitalization for those words in the output from Primer Prep. Would that be of any help?

2 Likes

It seems to me that what you want to do is to find the difference the original and new wordlists, ignoring case and word frequencies, to find which are the new words in the new wordlist that need to be verified? Is that right?

2 Likes

Yes, I need more information on the pivot table method. I once tried it but it didn’t give me the desired result. Maybe there’s something I wasn’t doing right. Please give me more information on this.
I will also like to try the regular expression one.

Yes Sir. That will be very helpful. I will like to know which words are actually new, so I can edit only those new words and add them to the existing wordlist.

The screenshots here are from LibreOffice Calc. (I think Microsoft Excel and google sheets have similar functions.)

  • Open a new, empty file

  • In cell A1 enter the heading Word

  • In cell B1 enter the heading Count

  • From your first source file, copy the first two columns (containing word and count)

  • Paste the data into the new file starting at cell A2. In this simple example, there are just three words from the first source file.

  • From your second source file, copy the first two columns (containing word and count)

  • Paste the data into the new file starting just below the data you pasted in a previous step. In this simple example, there are just three words from the second source file.

  • Select the entire region with the headings and the words and counts that you have pasted in. (In LibreOffice, click in cell A1, hold down the Shift key and type right-arrow, then Ctrl+down-arrow.)

  • From the “Data” menu, select “Pivot Table”, then “Insert or Edit”

  • In the Select Source dialog, the “Current selection” option should be selected (but if not, select that option). Then select OK.

  • In the Pivot Table Layout dialog, you will drag the two items (“Word” and “Count”) from the rightmost pane. First drag “Word” to the “Row Fields” box. You should see “Word” appear in the “Row Fields” box.

  • Now drag “Count” to the “Data Fields” box. You should see “Sum - Count” appear in the “Row Fields” box.

  • Click OK. A new sheet named “Pivot Table_Sheet_1” appears with the combined counts. In this very simple example, you can see that it combined counts for the two items that were in both source files (“abc” and “ghi”) and brought the other items as well.

  • Ignore the heading line at the top and the total line at the bottom and copy the data from the two columns to the desired location. (You can then delete the temporary working file.)

2 Likes

With Microsoft Excel 2019, follow the LibreOffice Calc steps until the “Data” menu one, then use the following steps:

  • From the “Insert” menu, select “Pivot Table”.
  • In the “Create Pivot Table” dialog, the “Select a table or range” should be selected and the range you selected previously should be listed. The “New Worksheet” should be selected. Click “OK”.
  • Under “Choose fields to add”, tick the box next to “Word” and then the box next to “Count”. You should see a table similar to the one in the above instructions.

With google sheets, follow the LibreOffice Calc steps until the “Data” menu one, then use the following steps:

  • From the “Insert” menu, select “Pivot Table”.
  • In the “Create Pivot Table” dialog, the range you selected should be listed and “New Sheet” selected. Click “Create”.
  • In the Pivot Table Editor on the right, under the “Suggested” heading, select “Sum of Count for each Word”. You should see a table similar to the one in the original instructions.
2 Likes

Thank you very much @drowe. This is very helpful. I just tested it. I find that in two new wordlists that sum up to 5286 rows, only 612 rows were new (i.e. different from those in the already existing wordlist).

1 Like

I’ve gone ahead and opened an issue with the PrimerPrep team to ask if they might consider turning off the automatic lowercasing behavior:

2 Likes

It’s good you requested. But it may be a complex issue because all words at start of sentence will be capitalised even though they are not proper nouns.
One way to solve this will be to find a way to instruct: “if a word appears in all caps, and in initial caps and lower case, convert to lower case.” I don’t know how easy that will be.

Glad that @drowe’s solution worked for you; let’s go with that for now!

I still need to know how to do this. It will be helpful.

This isn’t very elegant, but might do what you want.
If you repeat the spreadsheet exercise, but this time paste in the words from the new list twice. Then replace the count value with 1 for all three lists (the existing list and both copies of the new list).

Then repeat the Pivot Table part. Now the Sum of the Count field will be either 1 (existing list only), 2 (new list only) or 3 (both lists).

Copy that data to a new sheet, then sort by the second column as the first sort field, and the first column as the second sort field, then take the list of words with a “2” in the second column. This should be an alphabetical list of the new words.


In this simple example, “jkl” was the only new “word”, but had there been more they would show up with a “2” in the second column.

No worries, @mengheng will assist you with this :grinning_face:

Hello @katelem,

Please install the project and follow the instructions. Let us know if you have the result you are looking for. Thanks!


Instruction

Wordlist Prerequisites

  1. In a wordlist excel file, must have two sheets named: “wordlist” and “new_wordlist”:
    • The “wordlist” sheet is for the exisiting wordlist from Keyman Lexical Model.
    • The “new_wordlist” sheet is for the new words to be added.
  2. If there are other sheets in the file, they can be kept as long as their sheetnames aren’t:
    • “_wordlist”
    • “_new-words-only”
    • “_new-changed-only”
  3. As of now, both wordlist sheets require to have 2 columns (word and count) with data.
    • If not, the code won’t execute properly.
  4. After executing the code, you will see 5 sheets in total in an excel wordlist file.
    • “wordlist”
    • “new_wordlist”
    • “_wordlist”
    • “_new-words-only”
    • “_new-changed-only”
  5. Please see the example wordlist.xlsx file to understand better.
  6. Feel free to use the result and modify the words accordingly.

Software Prerequisites

  1. Install VS Studio Code link
  2. Install Python v3.13 or above link
  3. Install Pip v25.1.1 or above can follow along
  4. Open VS Studio Code → Open Terminal → New Terminal
  5. Make sure to be in the path: path/path/wordlist_comparison (the project path)
  6. Run pip install -r requirement.txt
    • If the installation is incomplete or cause an error, please address them or reinstall using this: pip install pandas openpyxl

Execute Project

For Windows

  • Open VS Studio Code → Open Terminal → New Terminal
  • Make sure to be in the path: path/path/wordlist_comparison (the project path)
  • Run py main.py
  • The Output should be:
    Deleting existing _wordlist and _new-changed-only sheetname...
    Successfully created a sheet for _wordlist and _new-changed-only.
    Deleting existing _new-words-only sheetname...
    Successfully created a sheet for New Words only.
    Data construction completed.
  • You can open the wordlist.xlsx file to see the result.

IMPORTANT
The excel file must be closed to execute the code.

NOTE
It may take longer if the wordlist is large.