Welcome to the discussion! We thought we'd start this conversation by talking about the data behind the graphics. In this discussion, we're asking you to share your experience and advice in accessing, collecting and using data. We also encourage you to ask questions to this experienced group of conversation leaders! Consider these questions below when sharing your comments in this discussion topic:
- Where to find existing data
- How to collect new data
- How to clean and analyze data
- Share examples, guides and other resources that would be helpful for defenders in understanding data.
Share your thoughts, experiences, questions, challenges and ideas by replying to the comments below.
For help on how to participate in this conversation, please visit these online instructions. New feature: you can now add images and video to your comments!
Hi all, very happy to be taking part in this conversation!
Just to kick things off; as part of our project School of Data (http://schoolofdata.org/ ) we have collected some tools to assist with data cleaning, http://schoolofdata.org/online-resources/
and we have a couple of 'recipes' which provide more assistance on how to use the tools, kindly created by Tactical Tech! http://schoolofdata.org/courses/#IntroDataCleaning
We welcome feedback on any of the above, or if there are tools that we have forgotten, let us know!
Thanks for sharing, Zara! School of Data certainly has some great resources for advocates new to the data-cleaning world.
I was wondering if you or the School of Data have any words of wisdom around data-cleaning. Are there any common mistakes or pitfalls that human rights advocates run into when cleaning their data? Is there anything about the data used in human rights campaigns that make it different than other types of data-cleaning? It would be great to learn more about these lessons, and to hear stories of the challenge (and glory!) of cleaning up data for human rights campaigns. Thanks!
-- Kristin Antin, New Tactics Online Community Builder
I just wanted to expand a bit on what we mean when we talk about "cleaning" data. In the simplest terms we are talking about reducing errors in a dataset. Often data is collected in different formats by different people in different locations. Errors are bound to occur.
When we say cleaning we are referring to finding and removing unwanted bits in datasets, for example an extra zero in a spreadsheet cell can make a huge difference to a budget. We are also talking about formatting data correctly for the tools that you are using and dealing with inconsistencies in the data (for example making sure all dates are recorded in the same way). We also mean structuring the data so it can be used effectively for what you want it to do and to help you analyse it.
Open Refine is a common tool used for cleaning datasets, and spreadsheets can do a decent job too. As Zara mentioned check out School of Data for more... http://schoolofdata.org/courses/
Thanks, Emma. I've enjoyed exploring the School of Data courses! Very useful.
There's one in particular that addresses my question above about common pitfalls in interpreting data: http://schoolofdata.org/handbook/courses/common-misconceptions/ This module highlights a few common mistakes like using the 'average' is an inappropriate way, skewing the size of the visuals that you use to represent data, using inconsistent timelines, and the correlation versus causation trap.
If anyone is brave enough to share a real-life example of when you have fallen into one of these traps - please share!
- Kristin Antin, New Tactics Online Community Builder
As a Crisismapper and open community leader, there are a few key things that should be done before even starting your data scramble. I like to call these the 5 T's: Trust, Time, Team, Training, and Tenacity. Simply put, you really need a plan before you start looking for data. These pillars will guide you as you dig in.
Start with a list of master keywords then start searching: country open data portals (e.g. UK data), the Open Data Index, World Bank, UN data and so on. Also, look for topical data like - the International Transparency Aid Index (IATI) etc. Don't discount the power of a straight eyeball Google search. Once you get digging, you might find the 'right rabbit' hole.
There is no one data catalogue or master data search for all open data information. For the Standby Task Force and Crisismapper activities, we usually start with a google doc shared widely to find out which sources are available. By collecting these datasets in the open, you ask your networks and, hopefully, other people's extended networks.
There are whole courses, (e.g. Coursera, TechChange, School of Data) on how to get the data clean and analyzed. With the Kenyan Elections (Uchaguzi), we had a few teams organized. There are clear needs: You need to have data categorized and a list of what constitutes a 'clean data item'. Then, engaging folks with research and technical skills to help clean and analyze data. There some key tools that are helpful: R, Sublime Text, Excel etc. But, data cleaning is really (again) about rolling up your sleeves and reviewing, analyzing and making decisions about the data. Remember - your assumptions and biases could change the shape of the data. Be clear on what are your goals and the limitations of your project. This is why I think a strong plan is so key before you throw maps or other data viz at a concept or idea. This is especially important when you are referring to complex issues like Human Rights.
Here are the data cleaning guidelines that I created along with friends from GroundTruth, OKFN, and Datakind. Please use and remix.
Some posts I've written:
Serving @ Data (includes a list of resources)
Cameras as Evidence - some guidelines on video and image data
You Can't just throw a map at it
Data, So what -(slideshare)
Happy Monday. Looking forward to your thoughts.
When collecting data, it's always good to think with this framework in mind: What, Where, When, and Who. It's also a good practice to understand data collection requires a lot of resources and you have to adjust your data collection tool or your mindset depending on the situation. For example, if you're trying to get information during a disaster (Typhoon, Pandemic, Earthquake, etc.) then you need to keep it simple, there is always the chance to go back and collect more detailed information. In these situations, open fields are better than structured fields, however, you could also use tags (these can evolve or change over time) to better index the data. Location is important, of course Latitude Longitudes are great if you can have them, but most reporters may not be equipped with a GPS or smart device (also depends on the network), so consider rolling up to the nearest resolution (address, or neighborhood, or county, or city, or region, or even country). Make sure you make it clear to differential between the reporter location and the event’s location. For example, a reporter with BBC in London reporting on an event in the Philippines are two different things. Allow in addition optional information, such as contact person for the reporter (important esp. in crisis situations so you can solicit additional information or engage in asking clarity on information submitted in order for you to verify). Please note that you have to ensure privacy information are protected at ALL times, and always ask a consent whether the contact information provided to you is ok to store or share. DO NOT ASSUME that you can share anyone's contact information unless they clearly state that. Even then, take extra precaution to assess the risk it'll have on the reporter (imagine someone's information is being shared with you during war times, this information could potentially harm or kill the individual if you share it with others and it gets into the wrong hands). If you can't encrypt the information in transit (using secure tunnels, like https for when the information is submitted to you) and at rest (inside the database), then consider deleting this information. Other optional information includes, videos, audio clips, pictures, links to stories or additional documents, etc. When you capture the time, make sure it’s consistent. At minimum make sure you get the Day, Month and year and give enough instructions to the format you need it in (nice to have a calendar pop up or drop down list to make it easier for reporters), month and day can be mixed up if not clear, for example 0/02/2013 can mean January 2nd or February 1st depending on the format you choose.
To the list of data analysis tools that Heather mentioned above, we would like to share with you also Tableau Software, they have a public version available for free. Please take a look at how we analyze our killings data in Syria Tracker (project of Humanitarian Tracker) using this free version here.
Thank you, Hend, for all this great advice! There's so much here to think about. I wanted to ask you (and all the other participants) about how best practices for securing the data you have. Let's say you are pulling together data on human rights violations in a certain location over a specific period of time. You have a number of sources that you are collecting the data from, and then you will organize it, analyze it, and then more forward with planning for creating visualizations. The data you are collecting is sensitive - it may include information on the victims, survivors, perpetrators, locations, dates, violations, etc. How do you keep this data secure so that it doesn't fall into the hands of a person who would use this information to harm someone?
Hend, you alreadymentioned that it's important to encrypt the information in transit and at rest. It would be great to learn about the tools that are available for this, as well as any other points along the data collecting/organizing/analysis process that we need to consider the security of the information.
I also wanted to share a resource regarding 'informed consent'. Collecting information via interviews could be a week-long discussion in and of itself! But if you are interested to learn more about informed consent you could review WITNESS's Conducting Safe, Effective and Ethical Interviews with Survivors of Sexual and Gender-based Violence (thank you Bryan Nunez for mentioning this guide in our October conversation). In that guide (on page 11) there is a section on informed consent which state:
Obtaining your interviewee’s informed consent is essential before you proceed with the interview.Informed consent is the interviewee’s agreement to be filmed and can only be provided after theyunderstand how the video will be used and who will see it. Ask your interviewee what the worst-casescenario for them might be (such as their perpetrator or community seeing the video and recognizingthem) and share potential strategies for mitigating these risks (such as concealing their identity while filming, during the edit or using a pseudonym). It is especially important to make clear that if thisvideo goes online, anyone may be able to see it now or in the future – and the reach of the video maybe amplified through social media. If the incident is in relation to a criminal case, depending on the jurisdiction, the footage could be subpoenaed – check with a legal professional if this is the case.
In a previous conversation on social media, Chris Koettl of Amnesty International USA recommended the second edition of the Professional Standards for Protection work, published by the International Committee of the Red Cross, that includes information on informed consent.
Any other resources out there on informed consent in the context of collecting human rights information?
-- Kristin Antin, New Tactics Online Community Builder
Quick response to Kristin's request for tools to secure and encrypt information. Security in-a-box (developed by Tactical Tech and Frontline Defenders) is a great resource for walking you through tools to help you do exactly this.
There is a chapter on 'How to protect sensitive files on your computer': https://securityinabox.org/en/chapter-4
and one on 'Protecting your information from physical intruders': https://securityinabox.org/en/chapter_2
Hope that helps!
Security In A Box is a great resource indeed. To build on top of Emma's and Kristin's comments, I'll add is that there is a difference between protecting data that you have obatined from third parties or secondary sources and protecting data during collection, especially when the collection is done on our behalf by people in the field. If you will be collecting anything that can be used to identify human rights subjects currently or over time, you need to consider your responsibility to protect it. End-to-end encryption, using open source tools, is the way I'd suggest to do it.
In terms of tools for secure collection, Benetech has been working on Martus for that purpose for over 10 years. As for other tools, there is a growing number of tools that could be customized for collection purposes now available thanks to the attention to digital security in the journalism field, SecureDrop and GlobalLeaks are good example of that trend.
I'm particularly interested in the increase use of mobile for data collection, we have actually released Mobile Martus recently and we are working with the team at the Guardian Project to improve it. I'd love to hear about other secure mobile tools from people on this thread.
Data at Rest – For mobile devices a best-practice is to not store/persist any sensitive data locally on device (if possible). Can be a requirement for custom-developed Crisis Tracking applications; should be an evaluation consideration for commercial/third-party applications as well. Regardless, cached and temporary data should be cleared upon closing application or switching contexts or at specified time intervals. Encryption should always be used, be it integrated into the OS at the hardware level, or implemented separately by applications. FIPS compliant encryption whether implemented automatically at the OS level (e.g., 256 bit AES hardware encryption for newer iOS devices) or by applications via FIPS-compliant software algorithms. In the end - the most effective way of mitigating impact of a mobile device security breach is to ensure no sensitive information persists on the device. For performance reasons (e.g., today’s emails) some local storage of sensitive data is often required – in these cases, hardware-level encryption and clearing of cache/etc is critical.
Data in Transit – For mobile, all network traffic (WiFi, 3G/4G, etc) should be encrypted – at several layers. All traffic should be encrypted from device out thru TLS connections. WiFi traffic should be encrypted again using WPA2 encryption tunnel (cellular data protocols may not enable WPA2 encryption tunneling, e.g. ATT 3G). Use device-level VPN configuration to add additional layer of encryption via FIPS-certified encryption tunnel.
Cloud-based file sharing services – be cognizant of risks associated with cloud-based file sharing services (e.g., DropBox, Box.net, Google Drive, etc). Increasingly popular, however many common use cases for these services open themselves up to security/privacy risks (e.g., inadvertently sharing a link to proprietary data, login credentials retained on mobile device, etc).
Online security and privacy – For crowdsource efforts, like ours with Syria Tracker (project of Humanitarian Tracker), we dedicated an Instructions page for reporters here. We encourage you to review the content there (and we welcome any updates as well). Additionally, we have share this guide with our reporters on how to protect their security and privacy online (Arabic version - English version).
I hope I'm not too late to participate in this dialogue, which I find really useful. I am working with both quantitative and qualitative data on human rights violations, and one reason is that I am seeking an understanding of why certain violations occurred, and what role diverse factors played in both generating the violations, and in the eventual decline of those violations. The what, when, who, and where questions are critical to this quest for understanding why, or for measuring causal factors. By themselves, the answers to these questions may suggest correlation, but not cause, while interviews with different actors in the situation suggest a lot more about the reasons why this violence occurred. These subjective accounts are often also contradictory and self-interested, but often also come from people who were there. And as others have noted, there are security issues regarding what sensitive human rights information can be cited from an interview, and how.
I am interested in how others use data - statistical, visual, narrative - to investigate the causes of human rights violence, and how you work with a combination of both quantitative and qualitative informaiton in that inquiry. Is there a process parallel to cleaning statistical data that addresses interview data? Maybe a series of questions for interrogating what is said?
As a person who primarily focuses on mapping and satellite imagery analysis, the WHERE of a situation is often the first and most important question to answer. We get requests from human rights organizations that work in (often) quite obscure areas of the world, so our first task is to figure out specifically WHERE of an event so that we can target satellite imagery. This can be a very difficult first step for a variety of reasons:
One of our first steps is pretty obvious: Google (or Google Maps/Google Earth). If that doesn't work, we often go to GeoNames or the GeoNetNamesServer. Both of these are databases of place name information to help guide us on our way. GeoNetNames is particularly good for obscure locations within a country- we have found locations of tiny villages in this database in the past.
If these databases let us down, we will sometimes make a gridded map to send to our partners to have them mark the location of interest on the map. In the past, this was just a PDF that we sent and they might respond with, "Oh, it's in square C14." Now, there are more tools that we can use for marking places, such as GeoPDF or ArcGIS Online, that make finding a place much easier than it was for us not that long ago!
Whoa - I had no idea that documenting a location could be so complicated! Thanks for sharing these challenges. Very interesting!
I'm curious about what happens after a partner has pointed out that the location of an event happened in 'square C14'. If there isn't a name of that place that makes sense to a wider audience (not exactly sure what I mean by this - just that if I told you something happened in xxnnl you wouldn't actually know where that is), but it's important to capture the precise location data...is the next step to document the latitude and longitude of that place? Are coordinates a widely used location data variable? If you're using coordinates in your data collection, is that a useful type of data for visualization efforts?
I think I'm still trying to wrap my mind around this! :)
-- Kristin Antin, New Tactics Online Community Builder
Once we have "It's in C14" we can determine the latitude/longitude for the area and order a satellite image from that information. Satellite imagery orders are at minimum 25sqkm of area, so we don't have to be horribly precise at that phase! Once we get the image, we can scan the area to find the precise location of the feature of interest.
Coordinates are so important for any type of geographic-based visualization! They help with being able to properly identify and symbolize a large number of points, rather than needing to draw things individually. I've attached a 'Before and After' of the demolition of the Njamenze area of Port Harcourt, Nigeria. The first image shows the area intact, while the second shows it completely demolished. When we did the assessment (third image), we recorded the coordinates for every building the the area so we were able to quantify and symbolize things very easily. This is just a basic example, but just imagine drawing 375 red dots then needing to change the colors to show different types of structures! That would be a lot of photoshop work!
Coordinates are also really important in social media analysis- a lot of what is volunteered these days has some sort of location associated with it, whether it is specific GPS coordinates from a smartphone or even just your home city listing on twitter. This is helping researchers learn so much more than they could without this associated geographic content. As a funny example, my friends over at Floating Sheep made a map of "Church vs Beer" which mapped geotagged tweets that related to either church or beer at the US county level! Can you imagine coming up with a map like that in any other way?
Ok, I see. Very interesting...
So, if I wanted to share GPS coordinates with you via my smart phone (or other phone?), how could I do that?
-- Kristin Antin, New Tactics Online Community Builder
I'm glad to be part of this dialogue and I'm looking forward to learning from and engage with all of you. Before I dive into more technical aspects of cleaning and collecting data, I'd like to share some general considerations that I have found useful when visualizing data for human rights in projects like A Costly Move: Far and Frequent Transfers Impede Hearings for Immigrant Detainees in the United States and Troops in Contact: Airstrikes and Civilian Deaths in Afghanistan.
Thanks, Enrique! Lots of great points.
One point that grabbed me was: In general, I'd suggest to keep a clear set of goals from the beginning, strong enough to be used as a constant reference yet sufficently flexible to allow for the integration of what we discover during the exploratory data-dives.
I wonder if you and the other participants have ever started a data analysis project that had outcomes you weren't expecting, that impacted the strategy for your campaign. I'm curious to hear your examples of this kind of situation, if they exist. Thanks!
-- Kristin Antin, New Tactics Online Community Builder
Following this is interesting yet not always easy---i'm a cg artist & long time activist--yet not having worked in the types of campaigns most of you focus upon, this is more education for me than implementation or directly applicable processes. Just enjoyable to see a gr8 New Tactics dialogue in process.
In that sense, I thought of a page i ran across recently re: social network analysis
--- apologies if it's not very relevant, but I thougth I'd pass it along, it's likely to be the only bit I can contribute to this discussion. It comes from the ICTWorks .org / ICT4D community & IREX folks (i try to keep up on their news also---info advocacy, etc) a post by Wayan Vota, hope it can be useful to any of you ::
(note --in trying to ascertain who their experts are, i'm noting here a few of the folks that they've got in their conference this week --that seem to dovetail w/ advocacy & other connected works New Tactics is committed to--i noted social justice advocate Rohan Grover, and Behar Xharra (Kosovo dispora writer into public diplomacy in online realms, etc)..
When thinking about a situation, whether current or in the past, we all have a certain expectation. But approaching data collection with a presumption can block views on surprises.
Here are three tips for everyone how to avoid overlooking the surprises that might be hidden in the data:
Do not pre-presume
Try to let the data speak for itself, be ready to be surprised yourself. A good article how to approach uncommon data is to "circle, dive and riff".
Visualize to see patterns
Data tables can only go so far. Tables are good to look up specific values ("amount of arms sold from one country to another in 2003"). But tables can be misleading when you look for patterns. The one article to know here is Anscombe's quartet. This is based on an article written by a Britisch statistician in 1970, where Francis Anscombe aimed to show scientists that a set of data can map out quite differently once visualized. The lesson for any data investigation is: Take time to do quick visualizations, which are not necessarily published. These are used to form an understanding, much like taking a lot of shots with a camera to "see" what the situation is like. Only over time you will come to the one view, one "photo" of the data, that is all revealing and eye-opening.
As a late add on and as those where not listed above. So, as tips for collecting and cleaning data here are two additional tools and one data set to train yourself on data exploration:
Open Refine for cleaning
Open Refine, fomerly Google Refine is a powerful, free software which let's you do certain cleaning tasks much quicker than with usual spreadsheet programs. For example, you might have 100.000 plus records where the person's names are all in one column (Mr. First Last). With Open Refine it is quite simple to split those into three columns in a few steps. The tool needs a bit of tinkering with to understand how it works, but we found that with five to six workflows you use all the time a lot can be achieved.
Another great use of Open Refine is to clean up same but different names for locations. Example: You have a survey where people used a lot of different names for the same location (NYC, New York City, New York, Big Apple). With Refine you can select the specific column and then make sure that only one description is used, which is often very helpful. In Excel you could use "search & replace", but this takes much more manual work - specifically when the dataset is big and very messy.
Statwing for exploration
For the most simple data comparison tool sparing you the need to dig through complex instructions, have a look at Statwing. This is a commercial product, but they have a free test option and quite fair policies for part-time use while keeping your data should you downgrade after really needing this. Statwing is pretty much like SPSS or R (two packages for statistical analysis), but super-easy to use. With a few clicks you can compare and track statistical significance by comparing values, etc. They have a few sample data sets.
Where do I find sample data to train myself?
There is one out there that (a) is the data of a relevant, memorable story and (b) it is large enough to get you into trouble with simple spreadsheet software. Try searching for "titanic.csv" which is a "comma-separated value" format for the passenger list of the Titanic.
If you want to try right away, go here:
Greetings - I just joined this conversation on an ad hoc basis today, and being quite new to this specific area of visualising information I feel I have already learned a lot by reading the previous posts.
My question is not directly related to previous posts but I wonder if some of you have some ideas/answers. One of the recurring questions on data collection/presentation in my job is: How do you engage in data collection from "alternative" (ie not online) information sources at community level, and then make this information relevant for decision making processes from the national to global level? I mean to focus not so much on verifying data (though that is another aspect) but on challenges with collecting information from personal sources on a regular basis, and "translating" this information into input at policymaking levels.
I work for an international network of locally based NGOs, and one of our main goals is to feed local perspectives on the prevention of violent conflicts into international strategies addressing them. One challenge we encouter is gathering relevant data from the ground, from information sources based in civil society, and presenting it to international stakeholders clearly and on a regular basis. We believe that the alternative information from the ground that our network members have is essential to the effective prevention of violent conflicts worldwide but we struggle in collecting and channeling that information through efficiently. I believe some of the challenges are more related to the nature of networks (eg communication, different languagues) but would love to hear if there are tools that can be adapted to, or help navigate, this issue!
Gesa Bent, Global Partnership for the Prevention of Armed Conflict
Great questions, Gena, and thanks for sharing your data challenges here. Makes me think that we should host an online conversation focused specifically on collecting data (there are so many aspects to it - like the use of ICTs, organizing data, cleaning data, securing data, types of data collection like surveys).
Regarding data collection from community-level sources, the first thing that came to my mind is participatory research. We hosted an online conversation on this topic and we have a case study about how a participatory research process was carried out in Southeast Asia not only to document and understand how free trade was affecting small scale food producers in Malaysia, Philippines, Thailand, Vietnam, Indonesia, Burma, Cambodia and Laos but also as an effective means to inform and engage producers themselves in the process and issue. Being able to engage the community in the research process and building the capacity for this kind of research can be very powerful. I hope these are helpful!
There are also some examples of how ICTs are being used to collect information from communities in our conversation summary on Empowering communities with technology tools to protect children.
Regarding your question about how to make the information collected relevant for decision makers - I hope that question is addressed in this week's discussion topics:
I hope others chime in, too!
- Kristin Antin, New Tactics Online Community Builder
Here is another example: Using participatory research to advance children’s social and economic rights.
Wona Sanana was established in 1999 to protect children’s rights by compiling information on the condition of the children of Mozambique after the 16-year civil war. The project combined data-collection on the welfare of children with community education to empower local people to take action and to promote improved policies addressing children’s rights. Through participatory research, communities learned about the problems facing their children and were encouraged to develop unique responses appropriate to the needs or their community.
I'll be listing here some examples of data sources:
One of my favorites, the Global Database of Events, Language, and Tone (GDELT) is a resource that could be great for contextualization.
These are notes from the November 14, 2012, Technology Salon NYC (TSNYC) where we discussed data visualization for immigration advocacy. There are several interesting things that Linda Raftree, the organizer, summarized on that post that may be of use.