Linux – Video Tag Editing

Not everyone may be as particular as I am about having my files organized, but I like to make sure everything is where it’s supposed to be. I make sure my music is tagged accurately. I also like to have my video files tagged correctly. What does that mean? Just like with audio files, video container formats include as part of the file some tags that provide information about the file. Those tags can include the name of the video, the year, and other information (e.g., genre, performers, etc.). If you rip files or have digital copies, it’s not really necessary to update the information in the tags. However, depending on the software you use to play your video files, having that information included in the tags substantially increases the odds that your video player will be able to figure out what the video is and will then be able to retrieve any other relevant data. Thus, having accurate metadata in your video files is nice. It’s not necessary, but nice.

I was cleaning up some video files the other data and realized that I didn’t have accurate tags in some of them. I opened the video in VLC and then clicked on “Tools” -> “Media Information”:

I wanted to see the tags in the video file.

Here’s what VLC saw:

Yep, I’m working with Frozen!

As you can see, it didn’t have any tags filled except “Encoded by.” It actually filled the title by pulling the name of the video file itself. The minimum tags that should be included in a video file are: title and year, but including genre and some of the performers is always nice.

While there are a number of music file tag editors that work very well on Linux (e.g., Picard), I have struggled to find a good video metatag editor for Linux. I had one that was working for a while, Puddletag, which actually worked quite well even though it only billed itself as a tag editor for music files. However, Puddletag does not appear to be maintained anymore and, as of Kubuntu 20.04, it is no longer in Ubuntu’s repositories and the PPA does not contain the correct release file. I could try building it from source, but I wanted to see if there was a good alternative.

After googling around, I found one that seems to work quite well – Tag Editor. (You have to love the Linux community: call the software exactly what it does!) Here’s the GitHub site. And here’s where you can download an AppImage (I went with “tageditor-latest-x86_64.AppImage”), which worked great on Kubuntu 20.04.

Once you’ve downloaded the AppImage, you can set it to be executable (right-click and select “properties” then, on the “permissions” tab, select “executable”) or just double-click it and allow it to be executed. It should load.

In the left pane, navigate to your video file:

Once you find the file, you can see all of the tags that can be edited. Fill in the information:

Once you’ve filled in the tags you want to add or modify, click on “Save” at the bottom of the screen:

I particularly like this next feature. Once you click save, it shows the progress and actually tells you what stage it is at in saving the tags in the file:

Progress is in the circle with robust information on what it is doing next to it.

Tag Editor also does something that I actually questioned at first until it saved my bacon – it makes a backup of the file before it writes the new file. The backup file is named the same as the original file but with a new file extension: “.bak”.

You can see the backup copy of Frozen (“Frozen.m4v.bak”) just below the updated copy.

I initially thought this was just going to be annoying as I’d have to go through and delete all the backup copies once I was done. However, I did run into a couple of files that, for whatever reason, could not be modified. Partway through the tag saving process, I got an error message. Sure enough, Tag Editor, in writing the file, had stopped midway through. If a backup file wasn’t made, I would have lost the video. I don’t know exactly what caused the errors, but I quickly learned to appreciate this feature.

Just to illustrate that the tags were updated, I opened the new file in VLC and went back to the media information:

As you can see, the Title, Date, and Genre fields are now filled with accurate information.

Unlike, say, mp3 audio files, video files can take quite some time to update because the file has to be re-written. With a very fast computer, this won’t take an exorbitant amount of time. But it is a much lengthier process than updating tags in mp3 audio files.

 293 total views,  28 views today

LibreOffice 6.4.3.2 – Not Showing Greek Letters/Symbols

I ran into an issue the other day that ended up taking me hours to solve, in part because I couldn’t find any other solutions online, which is pretty unusual these days.

Here was the issue: I was evaluating a paper (I’m an academic and read lots of papers) that had a bunch of Greek letters/symbols in it as part of a regression formula. On my computer running Kubuntu 19.10 with LibreOffice 6.3, the Greek letters showed up perfectly fine. On my laptop, which I had just reformatted and on which was a fresh install of Kubuntu 20.04 with Libreoffice 6.4.3.2, the Greek letters were all showing up as something other than Greek letters – odd symbols or dingbats or something. Here’s the version number from a fresh install of Kubuntu 20.04:

And here’s what was being displayed in LibreOffice with the document:

Those familiar with the Greek alphabet will clearly see that these odd dingbats or symbols are definitely not from the Greek alphabet.

I spent about three hours googling for a solution and trying various suggestions. Google is usually a Linux user’s best friend and it’s common that someone else has had the same issue or something similar. Alas, no luck this time. No one, as far as I could tell, had run into this exact issue before. The closest problems seemed to suggest that the problem wasn’t with LibreOffice but with my Linux installation and that I was missing some language packs. Specifically, these semi-related issues suggested I needed to install a language pack with Cyrillic characters. This suggestion seemed reasonable as this version of LibreOffice didn’t seem to ship with support for Cyrillic characters:

Screenshot from LibreOffice for inserting special characters; Greek is not included by default.

I installed a Cyrillic language package from the repositories and restarted my computer. Nothing. I was still getting dingbats instead of Greek letters. I tried about 10 more Cyrillic language packages thinking that maybe I hadn’t found just the right one, searching through the repositories for anything that mentioned Greek or Cyrillic. Haphazardly adding language packages doesn’t seem like a good approach, but I was getting desperate. Even so, it didn’t help. I still couldn’t display the Greek letters in the document.

Next, I tried uninstalling and reinstalling the same version of LibreOffice – 6.4.3.2, which is the version shipping with Kubuntu 20.04. That didn’t work.

After a couple of hours and no solution, I decided that I’d try a different version of LibreOffice. On their website, LibreOffice makes two additional release candidates or development versions available. I could have gone straight to 7.0.0, which was in Alpha, but I opted instead for version 6.4.4.2. To uninstall LibreOffice, I used the following commands (see here):

sudo apt-get remove --purge libreoffice*
sudo apt-get clean
sudo apt-get autoremove

To install the new version, you have to untar the files you downloaded then navigate to the DEBS folder you just unpacked, then run the following:

sudo dpkg -i *.deb

After installing LibreOffice 6.4.4.2, I opened the file that was having issues and, lo and behold, it worked just fine:

There are my lovely alphas, betas, sigmas, epsilons, and omegas!

I’m assuming this is a bug in LibreOffice 6.4.3.2 or, at a minimum, the folks who packaged that version left something out of it. Either way, I was frustrated enough at the end that I realized I needed to post a solution for others who may run into this. Since Ubuntu/Kubuntu 20.04 is an LTS (long-term support) release, having a serious bug shipping in the included version of LibreOffice is, no doubt, going to frustrate many users.

I spent a solid three hours on something that was working perfectly fine in LibreOffice 6.3 but broken in 6.4.3.2. That’s annoying. I’m a huge fan of LibreOffice and prefer it far and above MS Office. It’s mature enough software now that little regressions like this really shouldn’t happen.

 899 total views,  30 views today

Plex on Linux – Video Files Not Detected

I added a number of video files to my Plex server the other day and, when I checked Plex, some of them had not shown up in the corresponding library. I tried the obvious solutions. First, I selected the options for the library and then chose “Scan Library Files”:

That will often solve the problem, but it didn’t. Then I went for the bigger ask for my Plex server and selected “Manage Library -> Refresh all Metadata”:

That didn’t solve the problem either. Since my Plex server is on Linux, it took me a minute to think it through, but I wondered if the files I had put into the library had the appropriate permissions for Plex.

(Quick aside for those not used to Linux permissions. All files and folders on Linux have permissions assigned to them that dictate what can and cannot be done with them. Basically, it’s a 3×3 set of permissions. There are three options for what can be done with a file: Read, Write, Execute. Then there are three categories of those who can interface with the file: Owner, Group, Others. Thus, you can set the Owner to be able to Read, Write, and Execute a file while not allowing the Group or Others to do the same. Any combination of these permissions is theoretically possible.)

A quick check of the files on my Plex server showed that, indeed, some of the files had permissions settings that were likely interfering with Plex’s ability to index the files:

In the image above, you can see that the permissions for the two lower files (the first and second half of the Sevilla vs CFR Cluj game) are 600 with me as the owner and not “plex.”

My immediate response was to change the permissions to make them fully accessible to make sure that Plex could read them. To do that, I used the following command:

sudo chmod 777 -R /plexserver/videos/newfiles

Security gurus will freak out about this as I basically made those files accessible with no restrictions whatsoever. But, it did solve the problem with Plex. As soon as I changed the permissions, Plex was able to detect and index the files. However, that’s not best Linux security practice.

The general rule of thumb for Linux permissions is that you should grant sufficient permissions to do what you need but should never grant more lenient permissions than you need to. It turns out, Plex has a very nice article on Linux permissions. Based on that article, the command I should have used for those files was:

sudo chmod 644 -R /plexserver/videos/newfiles

Once I changed the permissions, my Plex server was able to find and index the file.

Technically, what I should probably do is make the user “plex” an owner of the files, which would then allow me to keep the permissions as restrictive as possible. That is done with the “chown” command. If I did that, I could simply give the plex user ownership and then leave the permissions such that the owner can read and write to the file (so, 600). Both of these options actually work, with the first being more permissive than the second. But, you can see with my files that I set one to 644 with me as the owner and another at 600 with plex as the owner:

And, in this screenshot, you can see that Plex was able to recognize them both and index them since it now has “read” permissions for the file:

So, what is the best approach here? If you’re lazy, you could set all the permissions to 777. That gives Plex (and everyone else) complete access to all your media files. If you want to be somewhat restrictive, you could set the permissions for media files in Plex to 644, which gives the owner read/write access and the plex user, if it is not an owner of the file, read access. If you want to be more secure, you could set the permissions to 600 and set the plex user as an owner of the file, which would restrict the permissions for the file to read/write for the owners and no access for everyone else.

 1,770 total views,  29 views today

Plex – Sports Videos Library

I use Plex to manage my music and video libraries. I actually don’t watch sports very often. The one exception is when I’m grading papers (I’m a college professor). I’ve found that watching soccer/futbol matches in the background while I grade papers breaks up the monotony of grading. With finals coming up, I thought it would be a good idea to record a few games so I had about 10 hours of soccer matches to watch while I grade papers. But I also realized that I wouldn’t remember which matches I had watched, how far into them I was, and I might want to watch some on my TV and others on my computer. Thus, why not put the matches on Plex?

That’s when I realized that Plex doesn’t really have a way to organize sports events. Plex is remarkably good at pulling the relevant information for movies and TV shows (I actually use tinyMediaManager to do the renaming and organizing for movies and TV shows, then put them into the respective Plex folders, which works extremely well). But Plex doesn’t have built-in functionality for recognizing sports events. That, in itself, isn’t a problem. I just need it to recognize the files in a library and keep track of what I’ve watched and what I haven’t. So, here’s how I organized the files.

I created a new top-level folder next to the same folder where I store my movies and TV shows that I called “sports”:

Inside, I created a separate folder for the league I wanted to watch – English Premier League, UEFA Europa League, etc. If I was planning on keeping the games after I watched them, I would have added a folder below that for the year, but I delete the recorded games after I watch them. So, the folder structure looks like this:

I then named the files as follows: [DATE]-[TEAM1 VS TEAM2]. This structure is visible in the above image.

In Plex, click on settings:

Scroll down to “MANAGE” Libraries and click on it:

Click on ADD LIBRARY

In the window that pops up, select “Other Videos.” That lets Plex know that it doesn’t need to try to retrieve metadata for the videos (this is also the option I choose for my home movies):

Name the folder. I called mine “sports”:

Click “Next” then find the top-level folder you created earlier:

Once you’ve found the folder, you’re ready to click ADD LIBRARY.

This tutorial was done using Plex 4.32.1. With that version, you have the option of pinning libraries to your home screen. I went ahead and pinned sports to my home screen so I can see it quickly:

Click on the library and you should see a list of all the sports videos you have in the library:

As you can see in the image above, I had started watching one of the games when I made this tutorial. This is why using Plex to watch the games is so nice – it keeps track of where I was and I can switch between devices.

NOTE: Tutorial created using Plex version 4.32.1.

 1,872 total views,  30 views today

LibreOffice Calc: Graphs with Two y-axes with Different Scales

While a bit technical, it’s occasionally useful to plot multiple data series that have very different scales in the same chart. Let me give an example to illustrate. Let’s say I want to see whether the number of Mormon temples being built aligns with the number of Mormon stakes (akin to a Catholic diocese) that are organized over time. (I’m a sociologist who studies religion; you’ll just have to go with my examples.)

However, the number of Mormon temples is in the hundreds while the number of Mormon stakes is in the thousands. If I plot them both on the same chart with the same y-axis (that’s the vertical axis), the number of Mormon temples is going to look really small and I won’t be able to see the variation over time in the number of temples, like this:

The chart shows that stakes have increased, but it looks like the number of temples has barely moved. LibreOffice Calc automatically creates the scale used for the y-axis based on the scale of the larger of the two data series, in this case, the number of stakes. Thus, the maximum value is 4,000 and the minimum is 0. What I want to do in this tutorial is to illustrate how to add a second y-axis on the right side of the chart that uses a different scale that is more appropriate for the number of temples.

To begin with, go ahead and create your chart with at least two data series, as I have shown in other tutorials, like this one. Once you have your chart with two data series complete, now it’s time to add a second y-axis with a different scale.

First, click on your chart then double-click it to open chart editing. Then, select the chart area by clicking on one of the axes (left or right doesn’t matter) and then right-click it. You’ll get a context menu with the option “Insert/Delete axes…” Select that:

In the window that pops up, you’ll see a second column labeled “Secondary Axes.” You want to select “Y axis.”

Click “OK” and you’ll see that a second y-axis has been added to your chart on the right side using the same metric as the left side:

The next steps are pretty straightforward, but before you do them, you should pause and think for a second so you don’t have to go back and undo what you’re about to do. You’re going to change the scale of one of the two y-axes, but which axis do you want to change? There isn’t a right or wrong answer here. Generally speaking, I typically see charts like this with the smaller of the two ranges assigned to the left axis and the larger assigned to the right axis, but, again, it is entirely up to you which way you choose to go. At this point, though, you need to make a decision. Then you can move to the next step.

I’m going to follow my suggestion above and change the y-axis on the left to a scale that fits with the number of temples (so, a smaller range of values) and keep the y-axis on the right with the larger range for the number of stakes. But before I change the scales of the axes, I need to tell LibreOffice which data series is going to align with which axis. Here’s how. Click on one of your data series lines, then right-click it and select “Format Data Series.”

In the window that pops up, you’ll see on the “Options” tab right at the top an option that says, “Align Data Series to” and then “Primary Y axis” (this is the one on the left of the chart) or “Secondary Y axis” (this is the one on the right of the chart). Since I selected the number of temples first, I’m going to leave that one aligned to the Primary Y axis:

Hit OK. Then select the other line (in my case, the number of stakes), right-click it, and select “Format Data Series.” On the “Options” tab, I’m going to select to align this line with the “Secondary Y axis”:

Once you do that, you’ll see that LibreOffice automatically adjusts the scale of the other axis. Here’s how my chart now looks:

You can see that it changed the scale of the Primary Y-axis (the one on the left) to a maximum of 250 to reflect the smaller range of that data series. If you want to customize the scale used, you can always click on the axis you want to modify and then right-click it and select “Format Axis”:

In the window that pops up, you can modify the scale of the axis by clicking on the “Scale” tab. If you want to change the values, click on the box next to “Automatic” to unselect it so you can put in your own values, then customize the value you add, like this:

When you have modified the scale to your satisfaction, select “OK” and your graph will be updated with the scale you want, like this:

The resulting chart now has two axes with different scales. It would be a good idea at this point to label the axes to reflect the differences. Simply right-click on the chart and select “Insert Titles.” In that window, add appropriate titles. The left y-axis is simply the Axes while the right y-axis is considered the “Secondary Axes”:

And your final graph will look something like this:

That’s how you can create a chart with two axes in LibreOffice Calc.

NOTE: This example was done in LibreOffice Calc version: 6.4.2.2 on a Linux-based operating system (Kubuntu 19.10).

 3,034 total views,  12 views today

LibreOffice Calc: Interpolating Missing Values in Graphs

Here’s my situation. I have some data over time but I’m missing values in specific years. I want to graph that data but would rather not have to estimate all of the missing values. It turns out, LibreOffice Calc can do that for you in your chart. Here’s how…

Imagine I’m plotting the number of congregations in the LDS Church over time (weird example, I’m sure you’re thinking, but I’m a sociologist who studies religion, so, yeah, that’s what I do). I have the number of congregations in 1841, 1849, 1901, etc. Basically, I have the number in certain years, but I’m missing the number of congregations in lots of other years. I could interpolate the missing values (Excel has this function built in; LibreOffice Calc does not, but you can do it following the approach I have detailed here). But, I don’t really need to do that for my project. I just need a chart that shows the growth of congregations over time.

My data are organized into two columns. Column A is years and ranges from 1841 through 2019. Column B is the number of congregations with the values I have and lots of blank cells:

Select the cells you want to plot (A1:B176 in my case) then click on “Insert Chart”:

You’ll get this window:

Since I want a Line chart, I’m going to select “Line” and because I want “points and lines,” I’m going to select that option as well. I also want “Smooth” lines rather than “Straight” lines, so I select that option, too:

Click “Next >” at the bottom. Since you already selected your Data range, you shouldn’t have to change that. However, we do want the “First column as label” for the x-axis of the chart. So, select that option:

Then select “Next >”. You shouldn’t have to change anything on the Data Series tab, so you can hit “Next >” again. On the Chart Elements tab, you’ll want to describe your chart elements. Add a Title and label your x-axis and y-axis. I also didn’t need a legend since I’m only plotting one data series, so I turned that off:

Then click “Finish.” You’ll have a chart, but it only has the points for the years when you have data, like this:

To add a line connecting the points and interpolating the missing data, click on the chart, then double-click it to modify the chart. Once you’re inside the chart, click on one of the points to select the data series, then right-click and select “Format data series”:

On the “Options” tab you’ll see “Plot Options” and just below that, “Plot missing values.” The default is “Leave gap.” Select “Continue line” and it will interpolate the missing values for you:

Select “Ok” and your line chart will now actually have a line, like this:

There you have it. A line chart with interpolated missing values in LibreOffice Calc without you having to calculate all of the missing values.

NOTE: This example was done in LibreOffice Calc version: 6.4.2.2 on a Linux-based operating system (Kubuntu 19.10).

 2,373 total views,  19 views today

Setting Up a New Windows Computer for Your Kids

I recently had a colleague contact me for some computer advice. He knows I’m a computer geek and was looking for some help setting up a new Windows laptop for his kids. He was wondering which antivirus software to buy.

If you’re at all familiar with my blog, you’ll know that I’m not a fan of Windows and run Linux almost exclusively in my house (I keep a Windows laptop around to use a book scanner). So, it may seem strange turning to a Linux user for advice for a Windows computer. But, it’s actually not that strange. Linux does so much right that it has taught me what you should do when setting up a computer, regardless of your operating system. So, here’s the advice I gave my colleague that I think would be good advice for anyone setting up a new computer for kids.

Antivirus Software

If what you want is just antivirus software, Microsoft Windows ships with antivirus protection now (Microsoft Security Essentials). If you don’t install any other software, you can make sure that you install that software. Plus, the price is hard to beat. It’s free. Also, from my perspective, it’s best for not slowing down your computer dramatically. Norton, Kaspersky, etc. all slow down your computer, which sucks. If all you want is virus/spyware protection, Microsoft Security Essentials is sufficient.

Additional Software

If you want additional software on the laptop to accomplish something else, that’s a different question. Since it’s for your kids, there are two types of software you could consider.

Home Internet Security

First, do you want to restrict where your kids can go online? I’m actually a proponent of simply teaching your kids good habits and not policing where they can go. That may not be your perspective. If you want to restrict where they can go, I’d suggest OpenDNS’s Family Shield (or Home). It restricts adult content and is free.

Ransomware

Second, there is also the concern of ransomware, which is basically if someone were to get a piece of software on your computer that then locks you out of your files. The easiest solution to this is just to install backup software like Dropbox and make sure your kids store any important files in the Dropbox folder. There is a free option that gives you 2 gigabytes. It’s generally good practice to back up all of your important files anyway (e.g., essays, homework, photos, etc.). So long as you have a backup, ransomware is basically not a problem. (See this guide for dealing with ransomware by Dropbox.)

Multiple Accounts and Administrator Accounts

You should also probably set up multiple accounts on the laptop, though, again, this is up to you and how much control you want to give your kids. Setting up a local account for your kids means they won’t be able to install software without your administrative password. If they don’t know what they are doing, this is generally a good idea.

Reinstall Windows

Finally, I’m not sure how good you are with computers, but I’d also suggest making sure you have a way to revert to a completely fresh install in case your kids manage to get past the software and screw things up. Since I build my own computers and run Linux, reinstalling my operating system is something I do regularly. But for most users, the very thought of doing that is terrifying. Microsoft has made that much easier.

Conclusion

Do all of the above and your laptop should work fine for years. Plus, all of the above will cost you exactly $0, just some time.

 2,487 total views,  10 views today

Plex/tinyMediaManager and Doctor Who Specials

I’m a science fiction fan. And I want my science fiction at my fingertips. To that end, I have slowly been digitizing my favorite series (e.g., Star Trek, Stargate, and now Doctor Who) and putting them on my fileserver that uses Plex to serve the episodes to whatever device I want. I recently ran into an issue with how to organize Doctor Who episodes on my fileserver and, once I figured it out, I thought I’d share it here in case others run into the same situation.

First, props to Doctor Who for occasionally having specials. They are always well done and lots of fun. So, I’m certainly not complaining. BUT, the problem is that the specials are not technically part of a season. That causes some challenges when it comes to how to organize the files.

I use tinyMediaManager for organizing my movies and TV shows on my file server. Not only does tinyMediaManager pull down all the information for my movies and TV shows but it also has the ability to rename the files and organize them. I like how it does all of this and it plays nice with Plex as well, which is important.

Now enter the problem with Doctor Who. Like most TV shows, Doctor Who has seasons. The standard ways to indicate the season and episode of a show are to include something like the following in the name of the digital file: “S03E14.” This indicates the show is from Season (“S”) 03 and is the 14th Episode (“E”) from that season. This works great for most of Doctor Who since most of the content is episodes.

But, what about the specials? The creators of Doctor Who regularly release off-season or out-of-season specials. These don’t get a season episode number. When I tried to get tinyMediaManager to scrape the specials, it didn’t know what to do with them. Basically, it ignored them because it didn’t know what they were since I didn’t know what to name them.

After a little googling, I found out that “Specials” have a special Season designation: “00.” So, for the first Doctor Who (2005) special called “The Christmas Invasion,” that aired on December 25, 2005, how you can name that file is:

Doctor Who – S00E167 – The Christmas Invasion

The naming convention breaks down like this. “S00” tells both Plex and tinyMediaManager that it is a “special.” tinyMediaManager then moves it into a folder called “Specials.” The “E167” is the episode number for the entire series (from Wikipedia). Once I figured this out, tinyMediaManager knew where to put the special:

And Plex began to recognize what the episode was:

Success! Now I can include Specials with my favorite science fiction series. There is order again in the science fiction universe.

 1,894 total views,  7 views today

R – create scatterplot with ggplot2

R has pretty amazing capabilities for creating charts and graphs. One of the most common packages for this is ggplot2. However, it’s not the most intuitive package I have used in R. So, I figured I’d illustrate how to make some relatively simple scatterplots in R using ggplot2. I’ll likely post instructions on how to create other graphs/charts in ggplot2 as well.

As with most of my R examples, I’m going to use the 2010 wave of the General Social Survey (R version here) to illustrate. You can open that file in R and follow along.

The point of a scatterplot is to visually examine the relationship between two variables. Since it’s well-known that there is a relationship between how much education someone has and how much money they make, let’s examine that relationship visually. The two variables in the GSS we’ll use are: EDUC (years of education ranging from 0 to 20, with 12 being a high school diploma and 16 a Bachelor’s degree) and REALRINC (the respondent’s income in 1986 dollars). Just because I want the resulting scatterplots to be more accurate, let’s go ahead and adjust the REALRINC into 2010 dollars first. Per this website, we need to multiply each income listed in the 2010 GSS by 1.989 to get 2010 dollars from 1986 dollars:

GSS2010$REALRINC2010 <- GSS2010$REALRINC * 1.989

Let’s just make sure the values make sense. To do that, first pull the first 10 incomes:

head(GSS2010$REALRINC, 10)

The “head” command tells R to simply list the first “10” values in the variable “REALRINC” and print them in the console. Here’s what I got as output:

[1] 42735.0 3885.0 NA NA NA 5827.5 42735.0 NA NA NA

We can check to see what the corresponding value in 2010 dollars is for the first value, $42,735, with the following command:

GSS2010$REALRINC2010[match(42735,GSS2010$REALRINC)];

The value I got was $84,999.92, which is $42,735 * 1.989. Groovy. Our formula worked.

Now to create the scatterplot. Install ggplot2 (I’m using version 3.2.1) and load the library.

install.packages("ggplot2")
library(ggplot2)

Now begins the rather bizarre part of working with ggplot2 – building the commands to create the charts.

The basic code for creating plots in ggplot2 is odd. You create a base object that includes the variables that will be used for creating the chart like this:

scatterEDUCxINCOME <- ggplot(GSS2010, aes(EDUC, REALRINC2010))

The first portion, “scatterEDUCxINCOME” is the base object.

ggplot” calls the ggplot2 package.

In the parentheses is, first, the database (GSS2010), followed by the “aesthetic” components of the chart (“aes” stands for aesthetic). In this case, it’s the two variables we’re going to use: EDUC and REALRINC2010.

This base object is really just the data for the scatterplot but not the scatterplot itself. To illustrate, you can tell R to display the base object:

scatterEDUCxINCOME

You’ll get some of the pieces of a scatterplot, but not the points:

To overlay the scatterplot onto the base object, you have to tell R what objects you want to be overlaid. Obviously, that should include the points, which would be done like this:

scatterEDUCxINCOME + geom_point()

geom_point()” tells R to overlay the geometric points for the scatterplot onto the base layer. The parentheses are useful as they allow for additional commands or modifiers for the points, as I’ll illustrate below.

Two points to note with the code above. First, when you run it, it will generate the scatterplot (see image below).

Second, you can actually save this entire piece of code into the base object, like this:

scatterEDUCxINCOME <- scatterEDUCxINCOME + geom_point()

However, doing so converts your base object into the scatterplot with the geometric points included. Since you may want to modify your scatterplot with additional features, it’s generally a good idea to not do that. Keep your base object simple and overlay things on top of it as you go until you get to where you want to be.

As noted, the geom_point() code can be modified. One of my favorite modifications is to add a command that adds a little bit of entropy to the scatterpoints, which is helpful when you have a large dataset. Otherwise, all the points are basically stacked on top of each other. Here’s how:

scatterEDUCxINCOME + geom_point(position = "jitter")

Of course, a scatterplot should have labels – a title and axis labels. ggplot2 tries to add the axis labels based on the names of the variables, but you can overlay these as objects onto the base object as well, like this:

scatterEDUCxINCOME + geom_point() +
ggtitle("Scatterplot of Educational Attainment and Individual Income") +
labs(x = "Educational Attainment (in years)", y = "Individual Income")

ggtitle adds a title to the chart.

labs(x = “”, y = “”) adds the labels for the x and y axes with the labels inside the quotes.

Two additional features are helpful in a basic scatterplot. First, adding a line to the chart to illustrate the relationship is helpful. I’ll show two. First, a standard regression line is always nice:

scatterEDUCxINCOME + geom_point() + geom_smooth(method = "lm", color = "red", alpha = 0.2, fill = "blue") +
ggtitle("Scatterplot of Educational Attainment and Individual Income") +
labs(x = "Educational Attainment (in years)", y = "Individual Income")

The code to add a smoothing line is geom_smooth. In this case, I modified that line to make it linear (“lm”), make it red (color = “red”), and changed the fill of the error around the line (fill = “blue”). The last modification was to change the transparency of the fill for the error (alpha = 0.2).

Other smoothing options are available. The default simply runs a best fitting line through the scatterplot:

scatterEDUCxINCOME + geom_point() + geom_smooth() +
ggtitle("Scatterplot of Educational Attainment and Individual Income") +
labs(x = "Educational Attainment (in years)", y = "Individual Income")

Finally, the scatterpoints can be modified in a variety of ways (e.g., thickness, transparency, color, etc.). Here’s a simple illustration:

scatterEDUCxINCOME + geom_point(position = "jitter", size = 4, colour = "green")

The points options that can be modified are detailed here.

A script file with the above commands is available here.

NOTE: This was done in R version 3.5.2.

 989 total views,  7 views today

R – find cases (rows) that match specific criteria

I regularly need to find a specific case or set of cases that meet some criteria when analyzing data, often so I can modify those values for one reason or another. The easiest way I have found to find such values in R is the “which” function.

As with most of my R examples, I’m going to use the 2010 wave of the General Social Survey (R version here) to illustrate. You can open that file in R and follow along.

In the 2010 GSS there is a variable for race (RACE). The options are: 1 = WHITE, 2 = BLACK, 3 = OTHER. To find all of the cases with a “3” in the dataset, I would use the following code:

which(GSS2010$RACE == "3", arr.ind = TRUE)

Here’s what the command is doing…

which” is the function that tells R to search for information that meets the criteria detailed in the parentheses.

GSS2010 is the name of the dataset.

RACE is the name of the variable in the dataset. By including the name of the variable, we restrict R to searching just inside that variable rather than the whole dataset. (The $ tells R that RACE is a variable inside the GSS2010 dataset.

The “==” indicates “equals” in R.

The target value, which can be text, characters, or numbers, goes inside the quotes. In this case, we wanted to find all of the cases with the number “3” which is code for “OTHER.”

arr.ind = TRUE tells R to include the index if the result is an array.

In the 2010 GSS, if you type in the code above, you’ll get a list like this:

 [1]    1   12   14   19   27   28   42   44   46   50   64   73   96   97  101  102  119  120  121  123  124  130  133  140  145  147  151  152
 [29]  153  154  159  161  180  185  190  194  195  199  211  213  217  220  230  245  263  275  278  287  288  295  297  301  305  312  314  318
 [57]  333  339  345  348  349  371  373  381  403  420  441  446  458  464  465  473  475  477  478  479  483  489  495  501  505  507  508  520
 [85]  550  554  561  564  567  591  593  631  704  712  713  715  716  732  741  749  770  776  792  793  805  807  823  824  829  877  901  956
[113]  973 1092 1112 1125 1186 1193 1218 1224 1264 1281 1291 1304 1307 1331 1336 1342 1345 1347 1352 1355 1356 1364 1365 1411 1423 1440 1441 1442
[141] 1444 1445 1446 1449 1451 1513 1523 1526 1528 1532 1534 1547 1550 1552 1556 1557 1559 1562 1564 1567 1568 1569 1570 1571 1572 1573 1574 1575
[169] 1576 1656 1660 1735 1764 1913 1933 1935 1993 2010 2011 2018 2019 2022 2038

The [1] is indicating that this is the first response. The [29] indicates that the next number is the 29th response. The numbers after the brackets (“[ ]”) indicate the row where that response was found. Thus, I know that the 161st row in my dataset has the value 3 in the variable RACE.

We can check this by using the following code:

GSS2010[161,"RACE"]

The result should be:

[1] 3

BONUS:

Should you want to modify the value for an individual observation, like the one we just examined, you could use the following code:

GSS2010[161,"RACE"] <- 2

This would change the values for that case from “3” (OTHER) to “2” (BLACK). I’m not really sure why you would want to do this in this instance, but, now you can. (There is a scenario when you might, but there are better ways to recode data.) Basically, the “<-” tells R to set the value of that specific observation to 2, overwriting the 3 that was there.

And if you wanted to change all of the values from 3 to 2, since you have a massive list, the easiest way would be to save all those values as a list, then have R change all the values in one fell swoop, like this:

OTHERRACELIST <- which(GSS2010$RACE == "3", arr.ind = TRUE)
GSS2010[c(OTHERRACELIST), "RACE"] <- 2

The above two commands would create a list in your environment called “OTHERRACELIST” that includes all of the row numbers of the cases with a 3. The second command then tells R to look inside the GSS2010 dataset and use the list (c(OTHERRACELIST)) to find all the rows you want changed in the RACE variable to “2.” That will then change the code for all of the people coded as “3” into a “2.”

A script file with the above commands is available here.

NOTE: This was done in R version 3.5.2.

 555 total views,  7 views today