Project organization and management#
Most of the the project organization material can be found at https://software-carpentry.org and http://www.datacarpentry.org
Many thanks to them for existing!
Structure or architecture of a data science project#
Some good practice when you will organise your project directory on the server, on the cloud or any other machine where you will compute:
Create 3 or 4 different directories within you project directory (use
mkdir
):
data/
for keeping the raw data
results/
for all the outputs from the multiple analyses that you will perform
docs/
for all the notes written about the analyses carried out (ex:history > 20221114.logs
for the commands executed today)
scripts/
for all the scripts that you will use to produce the results
Note
You should always have the raw data in (at least) one place and not modify them
More about data structure and metadata#
-
direct link to the tutorial used fo the lesson: Shell genomics: project organisation
-
good practice for the structure of data and metadata of a genomics project: Organisation of genomics project
-
Some extra material: Spreadsheet ecology lesson
Exercise#
This exercise combines the knowledge you have acquired during the unix and project organisation lessons.
You have designed an experiment where you are studying the species and weight of animals caught in plots in a study area. Data was collected by a third party a deposited in figshare, a public database.
Our goals are to download and exploring the data, while keeping an organised project directory
Set up#
First we go to our working directory for this training and create a project directory
cd ~/bioinfo_training
mkdir animals
cd animals
As we saw during the project organization tutorial, it is good practice to separate data, results and scripts. Let us create those three directories
mkdir data results scripts
Downloading the data#
First we go to our data
directory
cd data
then we download our data file and give it a more appropriate name
wget https://ndownloader.figshare.com/files/2292169
mv 2292169 data_joined.csv
Since we'll never modify our raw data file (or at least we do not want to!) it is safer to remove the writing permissions
chmod -w data_joined.csv
ls -l
Note
what if my data is really big? Usually when you download data that is several gigabytes large, they will usually be compressed. You learnt about compression during the installing software lesson.
Let us look at the first few lines of our file:
cd ..
head data/data_joined.csv
Our data file is a .csv
file, that is a file where fields are separated by commas ,
.
Each row represent an animal that was caught in a plot, and each column contains information about that animal.
Question
How many animals do we have?
wc -l data/data_joined.csv
# 34787 data/data_joined.csv
It seems that our dataset contains 34787 lines. Since each line is an animals, we caught a grand total of 34787 animals over the course of our study.
Our first analysis script#
we saw when we did the head
command that all 10 first plots captured rodents.
Question
Is rodent the only taxon that we have captured?
In our csv file, we can see that "taxa" is the 12th column.
We can print only that column using the cut
command
cut -d ',' -f 12 data/data_joined.csv | head
We still pipe in in head because we do not want to print 34787 line to our screen.
Additionally head
makes us notice that we still have the column header printed out
cut -d ',' -f 12 data/data_joined.csv | tail -n +2 | uniq -c
But while uniq
is supposed to count all occurrence of a word, it only count similar adjacent occurrences.
Before counting, we need to sort our input:
cut -d ',' -f 12 data/data_joined.csv | tail -n +2 | sort | uniq -c
We see that although we caught a vast majority of rodents, we also caught reptiles, birds and rabbits!
Now that we have a working one-liner, let us put it into a script
nano scripts/taxa_count.sh
and write
# script that prints the count of species for csv files
cut -d ',' -f 12 "$1" | tail -n +2 | sort | uniq -c
Saving the result#
bash scripts/taxa_count.sh data/data_joined.csv > results/taxa_count.txt
cat results/taxa_count.txt
Improving our script#
We would also like to know the distribution of the numbers of animals caught in plots each year. The year is the 4th column in our dataset and our script, in its current state, always selects the 12th columns of a file.
We can change our script to make it flexible so that the user can chose which columns they wishes to work on.
nano scripts/taxa_count.sh
# script that prints the count of occurrence in one column for csv files
cut -d ',' -f "$2" "$1" | tail -n +2 | sort | uniq -c
Now it doesn't make much sense to have it named taxa_count.sh
mv scripts/taxa_count.sh scripts/column_count.sh
Question
which year did we catch the most animals? try to answer programmatically.
Question
save the sorted output to a file in the results
directory.
Investigating further#
We'd like to refine our animal count and knowing how many animals of each taxon were captured every year
we can use cut
on several columns like this:
cut -d ',' -f 4,12 "data/data_joined.csv" | tail -n +2 | sort | uniq -c
Now that we are happy with our one-liner, let us save it in a script:
nano scripts/taxa_per_year.sh
then save the output to results/taxa_per_year.txt
bash scripts/taxa_per_year.sh > results/taxa_per_year.txt
Question
Which year was the first reptile captured?
The next step would be to refine our analysis by year. We will save one individual output for each year count
The seq command#
To perform what we want to do, we need to be able to loop over the years.
The seq
command can help us with that.
First we try
seq 1 10
then
seq 1997 2002
and what about the span of years we are interested in?
seq 1977 2002
Great! So now does it work with a for loop?
for year in $(seq 1977 2002)
do
echo $year
done
It does!
Before doing our analysis on each year, we still have to figure out how to do it on one year.
grep 1998 results/taxa_per_year.txt
"Grepping" the year seems to work. Now we need to save it into a file containing the year
First let's create a directory where to store our results
mkdir results/years
and we try to redirect our yearly count into a file
grep 1998 results/taxa_per_year.txt > results/years/1998-count.txt
bash
cat results/years/1998-count.txt
It seems to have worked. Now with the loop
for year in $(seq 1977 2002)
do
grep $year results/taxa_per_year.txt > results/years/$year-count.txt
done
ls results/years
Question
Put your loop in a script