Show Notes: Finding great sources of information is part of what makes genealogy so fun! But citing those sources may not be so much. In this episode professional genealogist, Gail Schaefer Blankenau makes the case why source citation is a vital part of great genealogy research and she’s going to give us the resources to help get the job done right.
Listen to the Podcast
- Complete show notes
- Watch the video version of the interview
- Downloadable ad-free show notes
(Premium subscription required Become a Genealogy Gems Premium Member today.)
These wonderful sponsors make this free show possible:
Get your MyHeritage DNA kit here
Get 20% off Newspapers.com. Click here and use coupon code genealogygems
Visit Fort Wayne
Fort Wayne, Indiana is the home of the second largest free genealogy library in the country. Make your plans to visit today. Learn more at https://www.visitfortwayne.com
Show Notes: I’m trying out the MyHeritage AI Time Machine photo tool! Check out the results and get tips for getting the best results.
Video Premiere with Live Chat
How it works
AI Time Machine™ utilizes text-to-image technology licensed from a company called Astria. Using a variety of photos of one person that you upload to the website, it builds a model showing that person in a variety of poses and lighting conditions that are different from those in the original photos. Then, using a series of predefined themes, it synthesizes the model with motifs from a large variety of historical themes to create the photorealistic images.
Cost: As of this writing, Complete subscribers will be able to create 3 complimentary models per year and receive all available themes for each model (Subject to change.) Keep an eye out for free promotions throughout the year.
Photos you’ll need for the best results
- Do one person at a time.
- Crop group photos down to one individual.
- The more photos you use, the better the results. Take extra photos if you don’t have enough already on your phone.
- Don’t use an assortment of photos at different ages.
- Use an assortment of poses: 3 full body shots, 5 medium (waist up) shots, and 10 close-ups.
- Use photos with an assortment of poses and expressions, with your eyes looking different directions.
- Use photos taken on different days and with different backgrounds.
- Avoiding makeup is recommended.
- Log into your free MyHeritage account. (Don’t have one? Use our link to go to the site and sign up for a free account.)
- In the menu go to Photos > AI Time Machine
- Click the Try it Now button
- Click the Select Photos button or drag and drop your assortment of photos from your computer onto the box on the screen. You can select all of them by clicking the first image, holding down your Shift key (Win) and clicking the last image in the collection.
Uploading can take several seconds or even a few minutes. Leave your browser tab open until you see the screen telling you they will email you when your photos are ready.
- When you receive the notification email, click it to go to your photos on MyHeritage.
- There will be many boxes of photos, each representing a theme. You can scroll through them to quickly find ones that look good. It’s normal to see many that didn’t work out. This is due to the particular photos that you uploaded not suiting the image very well. However, if you follow the guidelines above, you should have many excellent photos to choose from.
- Click the desired photos and download them to your computer.
The Results are Amazing
Here I am as an Egyptian Queen
And here I am ready to head off into the 1950 skies as a stewardess:
And I could be the third sister my Grandmother’s family in the 1930s:
Premium Elevenses with Lisa: A candid look at some of the ways technology has already wreaked havoc on genealogy and how we can apply those learnings going into 2023 to ensure we enjoy the benefits of tech while side-stepping the pitfalls.
Watch the Show
Premium exclusive benefit: Download the show notes PDF.
Over the last few years, we’ve been seeing tech causing some havoc generally speaking, and in the genealogy world specifically. Most recently this has happened with the 1950 census where “artificial intelligence” was used to index the records. This exercise has revealed some eye-opening truths about the limitations and problems with technology.
I’m sharing my personal thoughts and opinions because I think it’s this is such an important topic. That’s not to say I have all the answers, because I don’t. However, I have been devoted to the genealogy industry for 16 years and I know and have had the opportunity to talk to folks in leadership roles at all the major genealogy companies. I’m not pointing fingers or criticizing. I’m asking question, and that’s a healthy thing to do. As genealogists and consumers, we should be interested in and aware of the impact that technology is having on the records we seek.
You may recall my video from Nov. 2020 called The Impact of Artificial Intelligence on Genealogy. In it I posed some questions and brought to light some of my concerns about the direction tech was taking.
It’s our nature to hope for the best and look at the bright side of things. It’s exciting when tech can accomplish things that couldn’t have been possible even a decade ago.
An example of the positive potential of Artificial Intelligence (AI) is the Newspaper Navigator tool at the Library of Congress. Watch my video called How to Find Photos and Images in Old Newspapers with Newspaper Navigator (published Sept. 2020)
Those positive experiences can give you a false sense that tech is always an improvement, particularly when we talk about it in conjunction with genealogy. So, while it’s easy to focus on the bright side, we need to look at all sides.
I recently received a comment on that Impact of Artificial Intelligence on Genealogy video that illustrates the optimism and belief that many folks have in what AI can do:
“I have long wanted someone to apply AI to the indexing of ship records; While it might be too much to expect AI to “recognize” handwriting, AI should at least be able to correct errors that human indexers introduced into the electronic files of those records.
I’m particularly thinking about ship records of Polish immigrants. There are really only about 200 common Polish GIVEN names …with oddities like Wawrzyniec, Kunegunda, Szczepan, Wojciech, etc. that seem to get mis-indexed 90% of the time. Clearly these could get fixed with ease.
(Furthermore, these given names sometimes get mistaken for surnames, so AI could also determine when to switch the two fields). Beyond that, there is a book of 341,055 Polish surnames in a book by William F. Hoffman with correct spellings of surnames.
So, I encourage AI researchers to take on this application: There’s an abundance of data in Ancestry or FamilySearch, and even if AI were to produce suboptimal results, the output would still be an improvement over the current status of the indexed records.”
Wow, that’s a tall order!
Evidence that Tech can Wreak Havoc
But wait a second: Have you tried calling the cable company or the phone company lately?! They are nightmare example of how technology is being used to replace people and provide “customer service”. Has that really improved customer service?
I regularly get emails touting how AI could improve Genealogy Gems: “Our AI Tool can write high quality content for you (emails, blogs, website copy, ad copy etc.) in seconds is for a limited time 40% off.” Can you imagine the havoc it would wreak if I implemented that? No thank you!
The Importance of Asking the Right Questions
We can’t just ask the question “how can technology help me?” We must also ask “what is the price of implementing this? What problems might it cause?” If we don’t ask these questions, how can we really understand and trust what we’re seeing, particularly in online research?
My hope is that this review of what happened with the 1950 census indexing will help you enjoy the benefits of tech while side-stepping the pitfalls. The following information is based on my own research and candid interviews with genealogy company insiders directly involved with leading the effort.
The 1950 Census Publishing Timeline:
- Prior to the release of the 1950 census, we heard that “AI” would be a game changer when it comes to indexing it.
- April 1, 2011, the 1950 census was released by NARA
- NARA also immediately released its own basic text-recognition generated index
- Ancestry did a separate more robust “AI” index that they released mid to late April, and they gave it to FamilySearch for review by human volunteers.
- FamilySearch predicted the index would be completed by 6/14/22
- Summer 2022: Crickets. At one point in late summer 2022 it was computed that it was going take them almost nine years to finish if they stayed on the current course with volunteer reviewers.
- Around September 2022 the plug was pulled on the volunteer program. Ancestry, FamilySearch and MyHeritage went their separate ways, sending their data our to vendors to complete, while doing some of the cleanup work in-house. At this point all the companies were on their own. This means each company has a different version.
- 11/19/22 MyHeritage announced their full index was now available.
- November 2022 FamilySearch says: “Yes, so it’s all finished. Everything right now is published on Ancestry and on FamilySearch.”
I was surprised at what I heard in the interviews I conducted. Was technology and specifically “AI” really the boon to the 1950 census indexing effort that it was touted to be?
The Bigger Questions
There are some bigger questions that also need addressing:
- How can technology actually introduce new errors and problems?
- How can our understanding of those issues help us more strategically improve our searches?
The Consortium: Ancestry, FamilySearch and MyHeritage
They initially worked together in a variety of ways:
- Ancestry provided the “AI” Technology
- FamilySearch provided the volunteer review of AI-generated data using their Curation Tool. “192,000 people help us review the census.”
- MyHeritage participated financially
Ancestry took their output from their AI process, and as they were giving it to FamilySearch for curation, they started to build the full set, so they could release their AI index as a full index, hopefully much better than NARAs as far as more fields. It’s certainly more robust.
All of the territories were given to vendors from the beginning. In the case of Guam, or Panama Canal Zone, they were given to these vendors that the industry uses, based mostly in India, in the Philippines or in China.
Be aware that the number of fields available varies between the genealogy companies.
Ancestry indexed all of the supplemental questions. FamilySearch has the data as well. MyHeritage does not.
Debate about AI
There’s an interesting debate in the genealogy community about whether or not “Artificial Intelligence” or “Machine Learning” was actually used in the indexing. One insider told me “Why they call it AI is beyond me. It’s nothing to do with artificial intelligence. It’s just plain old handwriting recognition.” He said “AI” is the popular buzzword, and all the genealogy companies are guilty of using it regardless of how technically accurate it is.
FamilySearch told me “It was machine learning, and it improved it.” Others said AI helped created the handwriting recognition tool, but it wasn’t technically AI that was used directly on the indexing.
No matter what level of tech, or what you call it, it is technology. And the future of the genealogy industry is definitely focused on AI.
Overall, this did not go well
Indexing had a tech side and a human side. We’re discussing the tech side here. In reality the human side also relied heavily on the tech too. (i.e., FamilySearch’s Curation Tool)
The volunteer community didn’t fully embrace the 1950 census indexing for a couple of reasons.
- States were released over time. They had to check back.
- It didn’t keep people invested as in “let’s just come and get it all done so everybody has access.” People got access to their relative’s records for indexing which could reduce the incentive for some volunteers to do as much as volunteers did for the 1940 census.
- Volunteers experienced frustration entering “low genealogy value” numeric codes like occupations and industry, and dealing with changing instructions, etc.
- The National Archives index made it possible to search without investing in the FamilySearch index. “I think took some wind out of the community sales as well.”
AI Had Trouble with All Deviations
There were many areas where the text recognition software introduced are large number of errors, creating more work, not less work. Here are a few examples in no particular order.
Gender – It encountered a lot of variations and struggled to process it cleanly and did not “machine learn” from it. M, F, Male, Female, U for unknown, blanks, Son, Boy, etc.. They got thousands of versions of basically garbage in that field.
Normalization – After it runs through the “AI” and the human review, some of the fields go through what they call in the industry, Normalization or Standardization. This is not done or shared across the big 3 genealogy companies.
This is where tech blends with people which blend back with tech because the reviewers are relying on the curation tool. Since no one was prepared for this, their instructions were woefully inadequate in the beginning. It took a lot of adapting and pivoting to keep up with the curve balls. With bad data coming in from “AI”, it meant that volunteer review side got so complex that, in the opinion of one top source I spoke to, that it drove the community away.
Locations/Cities – One example of the problem is the city Las Vegas in Nevada. In 1950 it’s a pretty small city, about 40,000 people or so. But it is well known. After AI and human review, they got probably 300 different ways that Las Vegas was spelled in the data. Lowest Vogels, Least Lugares, etc. “Every variation imaginable.” So, the companies independently standardized those to improve the index. There was also the challenge of enumeration districts and precincts sometimes being entered on the census in place of city. Example: election district, precinct 5, Scott County, Texas.
Ultimately, it comes down to how much time and resources they could stomach to spend on these normalization tasks. At some point, they had to say, it’s good enough.
Normalization tasks were generally done mostly in house, but also by outside vendors. Each company has a team of people that they’re paying U.S. wages to normalize the data. It got expensive.
Most heavily normalized fields included birthplace, relationship, race, citizenship, marital status, gender and one of the most expensive was city.
Search Tip: When having difficulty finding an address or location, take advantage of ED Maps and Descriptions at the National Archives. Be aware smaller towns may not be there. Think about, what was the city nearby? Where was their post office? If they had to give their address to get a package or something, what would they have said?
Skewed Forms – There are images that were digitized by NARA where the forms are skewed or offset enough that Ancestry’s “AI” system couldn’t even figure out where the form was. AI couldn’t read it so no one else could. It’s said there are 4.8 million records (perhaps around 192,000 images) affected. They are all being processed by outsourced vendors now. Ancestry and FamilySearch have said they expected around 1.7 million of them to be available November 2022, and then the rest of them will be in January. However, past due dates have been missed.
Tip: All this means that there’s a potential that a family may be split between two pages, one of which was not digitized properly. If you’re having trouble finding someone, search for others in the family and be sure to browse the images themselves.
TIP: Get Images at NARA if you want highest resolution TIFF files. The genealogy companies used the JPEG images.
Not at Home – The “Not at home” entries by the census takers posed a huge challenge because “not at home” was not the data expected to be read by the text recognition indexing software in that field. And enumerators often wrote it over several columns. They also used inconsistent words: Vacant, not at home.
Census Enumerator Tick Marks – Many fields had tick marks added by the enumerators. You can often see these in the Gender field. Census workers may have done this after the fact as they were counting how many males and females are in this enumeration district. That data that came back from Ancestry and FamilySearch, even after FamilySearch reviewers had looked at it, with things like M ‘ (apostrophe), F ‘ (apostrophe). Everybody is having to work to fix it independently.
Names – Don’t be married to a spelling of a name. Besides human error, “AI” could have introduced errors too.
For the census’ of 1930, 1920 and 1910 there’s was a relatively short name field. So, if you had a person with a relatively long last name, the given names were often abbreviated. This may have also impacted the 1950 census. The most common thing is that the head gets the abbreviation then they do the ditto, which gives them some more space for the rest of the family. You’ll see the abbreviation on the head’s given name because they’ve got to enter both names.
Tip: Name Abbreviations
- Bill’s grandfather’s last name is Mansfield (9 characters long.)
- The average surname length is about 6.2.
- If you’re looking for a long surname, the chance of given name abbreviation goes up.
- Try searching for abbreviated given names for the head of household.
Search Features Continue to Evolve
The available search fields at FamilySearch continue to evolve. Keyword search is now available at FamilySearch.
While they indexed Supplemental Questions, it doesn’t appear to be fully searchable yet as of the time of this publication.
They don’t yet have an occupation search field. They expect to incorporate it.
Tip: Watch for FamilySearch to publish “Insights” using the aggregated data in 2023.
Insiders told me they didn’t really analyze the enumerator instructions in order to anticipate what they might see written on the census forms. In hindsight, doing so might have helped. An important reminder that we should take the time as genealogists to learn more about the records we are using.
An Exercise in Adaptation
From FamilySearch: “The project was really a testament to adaptation and being flexible because we ran into so many issues that we just did not anticipate at the onset. And it’s a testament to the volunteers who stuck with it because we, changed things on them a few times during the process.”
Bottom Line: Did Technology Help or Hinder?
Did technology introduce new problems?
- Insider Answer: “No, but yes, when there’s entries they didn’t expect.”
- FamilySearch: “Biggest problem was the number of fields that we were asking people to review.” There were fewer names on a sheet, and more pages in 1950. Linking families was a challenge.
- My opinion: Yes.
The Future of Automation
- It was probably too early.
- It will continue, but it will be slow.
- Publishers need to share more information about imperfections with consumers.
“There’s a lot of work being done by the major players looking at if we can operationalize more AI indexing, handwriting recognition, let’s call it what it is.”
“1950 is an interesting test case, where to me it proved that it was still too early. I don’t think I couldn’t win a debate with anybody that says guys, we could have done better faster, maybe with less money in producing 1950 than what you did with this AI and human review process. Especially because it was easy English.”
“Now to say you could scale that and do records from Hungaria using an AI processing it’s being looked at. Part of the hope, I think for automation is that we’ll be able to do records from Czechoslovakia, that are hard to find, and train humans to do that work. I know FamilySearch is all about this, that if AI can get us at least a decent index for these languages. If we could do a good enough index with automation, and release it, that could be a huge game changer for the industry.
I don’t disagree with that. But the caveat, and this is we then need to help educate our community about what the trade-offs of this are. Like, “here’s an index we made of this Finish tax record collection. We did it with a computer. It’s going to help you, but don’t think it’s perfect. And here’s why it’s not perfect. And here’s what you should do to overcome the imperfection.”
“I don’t think as an industry we ever do enough to tell people about the problems that we know about. We know that 4.1% of the records from 1950 aren’t with us yet. We like to gloss over stuff that the marketing folks don’t want you to say. I believe more that the users want the whole story. If you give it to them in a decent way, I think generally they’re going to understand, and they’re going to appreciate that.”
“I think the players are still going to invest quite a bit in automation, both producing content and exposing it. I think it’s still going to be a slow improvement over time. I’m not sure we’re going to see, “Oh, they finally figured it out! Next week. Oh, look, it’s finally perfect.” No, it’s going to be a slow, slow drip of improvement.”